Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA Parallel IO in the Community Earth System Model Jim Edwards John Dennis (NCAR) Ray Loy(ANL) Pat Worley (ORNL) Workshop on Scalable IO in Climate Models 27/02/2012 Jim Edwards Parallel IO in CESM jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA CAM ATMOSPHERIC MODEL CISL LAND ICE MODEL CLM LAND MODEL CPL7 COUPLER CICE OCEAN ICE MODEL Workshop on Scalable IO in Climate Models POP2 OCEAN MODEL 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Some CESM 1.1 Capabilities: – Ensemble configurations with multiple instances of each component – Highly scalable capability proven to 100K+ tasks – Regionally refined grids – Data assimilation with DART Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Each model component was independent with it’s own IO interface • Mix of file formats – NetCDF – Binary (POSIX) – Binary (Fortran) • Gather-Scatter method to interface serial IO Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Converge on a single file format – NetCDF selected • Self describing • Lossless with lossy capability (netcdf4 only) • Works with the current postprocessing tool chain Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Extension to parallel • Reduce single task memory profile • Maintain single file decomposition independent format • Performance (secondary issue) Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Parallel IO from all compute tasks is not the best strategy – Data rearrangement is complicated leading to numerous small and inefficient IO operations – MPI-IO aggregation alone cannot overcome this problem Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Goals: – Reduce per MPI task memory usage – Easy to use – Improve performance • Write/read a single file from parallel application • Multiple backend libraries: MPIIO,NetCDF3, NetCDF4, pNetCDF, NetCDF+VDC • Meta-IO library: potential interface to other general libraries Workshop on Scalable IO in Climate Models 27/02/2012 Jim Edwards Parallel IO in CESM CPL7 COUPLER CISL LAND ICE MODEL jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA CAM ATMOSPHERIC MODEL CLM LAND MODEL CICE OCEAN ICE MODEL POP2 OCEAN MODEL PIO VDC netcdf3 netcdf4 pnetcdf HDF5 MPI-IO Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Separation of Concerns • Separate computational and I/O decomposition • Flexible user-level rearrangement • Encapsulate expert knowledge Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • What versus How – Concern of the user: • What to write/read to/from disk? • eg: “I want to write T,V, PS.” – Concern of the library developer: • How to efficiently access the disk? • eq: “How do I construct I/O operations so that write bandwidth is maximized?” • Improves ease of use • Improves robustness • Enables better reuse Workshop on Scalable IO in Climate Models 27/02/2012 Jim Edwards Parallel IO in CESM jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA Separate computational and I/O decomposition computational decomposition I/O decomposition Rearrangement between computational and I/O decompositions Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA Flexible user-level rearrangement • A single technical solution is not suitable for the entire user community: – User A: Linux cluster, 32 core job, 200 MB files, NFS file system – User B: Cray XE6, 115,000 core job, 100 GB files, Lustre file system Different compute environment requires different technical solution! Workshop on Scalable IO in Climate Models 27/02/2012 Jim Edwards Parallel IO in CESM jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA Writing distributed data (I) Computational decomposition I/O decomposition Rearrangement + Maximize size of individual io-op’s to disk - Non-scalable user space buffering - Very large fan-in large MPI buffer allocations Correct solution for User A Workshop on Scalable IO in Climate Models 27/02/2012 Jim Edwards Parallel IO in CESM jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA Writing distributed data (II) Computational decomposition I/O decomposition Rearrangement + Scalable user space memory + Relatively large individual io-op’s to disk - Very large fan-in large MPI buffer allocations Workshop on Scalable IO in Climate Models 27/02/2012 Jim Edwards Parallel IO in CESM jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA Writing distributed data (III) Computational decomposition I/O decomposition Rearrangement + Scalable user space memory + Smaller fan-in -> modest MPI buffer allocations - Smaller individual io-op’s to disk Correct solution for User B Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA Encapsulate Expert knowledge • Flow-control algorithm • Match size of I/O operations to stripe size – Cray XT5/XE6 + Lustre file system – Minimize message passing traffic at MPI-IO layer • Load balance disk traffic over all I/O nodes – IBM Blue Gene/{L,P}+ GPFS file system – Utilizes Blue Gene specific topology information Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Did we achieve our design goals? • Impact of PIO features – Flow-control – Vary number of IO-tasks – Different general I/O backends • Read/write 3D POP sized variable [3600x2400x40] • 10 files, 10 variables per file, [max bandwidth] • Using Kraken (Cray XT5) + Lustre filesystem – Used 16 of 336 OST Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 3D POP arrays [3600x2400x40] Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 3D POP arrays [3600x2400x40] Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 3D POP arrays [3600x2400x40] Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 3D POP arrays [3600x2400x40] Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 3D POP arrays [3600x2400x40] Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA PIOVDC Parallel output to a VAPOR Data Collection (VDC) • VDC: – A wavelet-based, gridded data format supporting both progressive access and efficient data subsetting • Data may be progressively accessed (read back) at different levels of detail, permitting the application to trade off speed and accuracy – Think GoogleEarth: less detail when the viewer is far away, progressively more detail as the viewer zooms in – Enables rapid (interactive) exploration and hypothesis testing that can subsequently be validated with full fidelity data as needed • Subsetting – Arrays are decomposed into smaller blocks that significantly improve extraction of arbitrarily oriented sub arrays • Wavelet transform – Similar to Fourier transforms – Computationally efficient: order O(n) – Basis for many multimedia compression technologies (e.g. mpeg4, jpeg2000) Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Earth System Modeling Framework (ESMF) • Model for Prediction Across Scales (MPAS) • Geophysical High Order Suite for Turbulence (GHOST) • Data Assimilation Research Testbed (DART) Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM April 26, 2010 Penn State University Workshop on Scalable IO in Climate Models Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 26 27/02/2012 Parallel IO in CESM April 26, 2010 Penn State University Workshop on Scalable IO in Climate Models Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 27 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 100:1 Compression with coefficient prioritization 3 1024 Taylor-Green turbulence (enstrophy field) [P. Mininni, 2006] No compression Workshop on Scalable IO in Climate Models Coefficient prioritization (VDC2) 27/02/2012 Jim Edwards Parallel IO in CESM jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA 3 4096 Homogenous turbulence simulation Volume rendering of original enstrophy field and 800:1 compressed field Original: 275GBs/field 800:1 compressed: 0.34GBs/field Data provided by P.K. Yeung at Georgia Tech and Diego Donzis at Texas A&M Workshop on Scalable IO in Climate Models 27/02/2012 Jim Edwards Parallel IO in CESM jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA interface PIO_write_darray ! TYPE real,int ! DIMS 1,2,3 module procedure write_darray_{DIMS}d_{TYPE} end interface genf90.pl Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA # 1 "tmp.F90.in" interface PIO_write_darray module procedure dosomething_1d_real module procedure dosomething_2d_real module procedure dosomething_3d_real module procedure dosomething_1d_int module procedure dosomething_2d_int module procedure dosomething_3d_int end interface Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • PIO is opensource – http://code.google.com/p/parallelio/ Documentation using doxygen • http://web.ncar.teragrid.org/~dennis/pio_do c/html/ Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Thank you Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • netCDF3 – Serial – Easy to implement – Limited flexibility • HDF5 – – – – Serial and Parallel Very flexible Difficult to implement Difficult to achieve good performance • netCDF4 – – – – – Serial and Parallel Based on HDF5 Easy to implement Limited flexibility Difficult to achieve good performance Workshop on Scalable IO in Climate Models 27/02/2012 Parallel IO in CESM Jim Edwards jedwards@ucar.edu NCAR, P.O. Box 3000, Boulder CO, 80307-3000 USA • Parallel-netCDF – – – – Parallel Easy to implement Limited flexibility Difficult to achieve good performance • MPI-IO – – – – Parallel Very difficult to implement Very flexible Difficult to achieve good performance • ADIOS – Serial and parallel – Easy to implement – BP file format • Easy to achieve good performance – All other file formats • Difficult to achieve good performance Workshop on Scalable IO in Climate Models 27/02/2012