Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory rross@mcs.anl.gov I/O in a HPC system Application I/O System Software … Clients running applications (100s-1000s) Storage or System Network I/O System Software Storage Hardware … I/O devices or servers (10s-100s) • Many cooperating tasks sharing I/O resources • Relying on parallelism of hardware and software for performance 2 Motivation • HPC applications increasingly rely on I/O subsystems – Large input datasets, checkpointing, visualization • Applications continue to be scaled, putting more pressure on I/O subsystems • Application programmers desire interfaces that match the domain – Multidimensional arrays, typed data, portable formats • Two issues to be resolved by I/O system – Very high performance requirements – Gap between app. abstractions and HW abstractions 3 I/O history in a nutshell • I/O hardware has lagged behind and continues to lag behind all other system components • I/O software has matured more slowly than other components (e.g. message passing libraries) – Parallel file systems (PFSs) are not enough • This combination has led to poor I/O performance on most HPC platforms • Only in a few instances have I/O libraries presented abstractions matching application needs 4 Evolution of I/O software (Not to scale or necessarily in the right order…) • Goal is convenience and performance for HPC • Slowly capabilities have emerged • Parallel high-level libraries bring together good abstractions and performance, maybe 5 I/O software stacks Application High-level I/O Library MPI-IO Library Parallel File System I/O Hardware • Myriad I/O components are converging into layered solutions • Insulate applications from eccentric MPI-IO and PFS details • Maintain (most of) I/O performance – Some HLL features do cost performance 6 Role of parallel file systems • Manage storage hardware – Lots of independent components – Must present a single view – Provide fault tolerance • Focus on concurrent, independent access – Difficult to pass knowledge of collectives to PFS • Scale to many clients – Probably means removing all shared state – Lock-free approaches • Publish an interface that MPI-IO can use effectively – Not POSIX 7 Role of MPI-IO implementations • Facilitate concurrent access by groups of processes – Understanding of the programming model • Provide hooks for tuning PFS – MPI_Info as interface to PFS tuning parameters • Expose a fairly generic interface – Good for building other libraries • Leverage MPI-IO semantics – Aggregation of I/O operations • Hide unimportant details of parallel file system 8 Role of high-level libraries • Provide an appropriate abstraction for the domain – – – – Multidimensional, typed datasets Attributes Consistency semantics that match usage Portable format • Maintain the scalability of MPI-IO – Map data abstractions to datatypes – Encourage collective I/O • Implement optimizations that MPI-IO cannot (e.g. header caching) 9 Example: ASCI/Alliance FLASH ASCI FLASH Parallel netCDF IBM MPI-IO • FLASH is an astrophysics simulation code from the ASCI/Alliance Center for GPFS Astrophysical Storage Thermonuclear Flashes • Fluid dynamics code using adaptive mesh refinement (AMR) • Runs on systems with thousands of nodes • Three layers of I/O software between the application and the I/O hardware • Example system: ASCI White Frost 10 FLASH data and I/O • 3D AMR blocks – 163 elements per block – 24 variables per element – Perimeter of ghost cells • Checkpoint writes all variables – no ghost cells – one variable at a time (noncontiguous) • Visualization output is a subset of variables • Portability of data desirable – Postprocessing on separate platform Ghost cell Element (24 vars) 11 Tying it all together FLASH I/O Benchmark • FLASH tells PnetCDF that all its processes want to write out regions of variables and store them in a portable format 120 100 80 60 40 20 0 16 32 64 128 Processors HDF5 256 PnetCDF • PnetCDF performs data conversion and calls appropriate MPI-IO collectives • MPI-IO optimizes writing of data to GPFS using data shipping, I/O agents • GPFS handles moving data from agents to storage resources, storing the data, and maintaining file metadata • In this case, PnetCDF is a better match to the application 12 Future of I/O system software • More layers in the I/O stack – Better match application view of data – Mapping this view to PnetCDF or similar – Maintaining collectives, rich descriptions Application Domain Specific I/O Library • More high-level libraries using MPI-IO – PnetCDF, HDF5 are great starts – These should be considered mandatory I/O system software on our machines High-level I/O Library MPI-IO Library Parallel File System I/O Hardware • Focusing component implementations on their roles – Less general-purpose file systems - Scalability and APIs of existing PFSs aren’t up to workloads and scales – More aggressive MPI-IO implementations - Lots can be done if we’re not busy working around broken PFSs – More aggressive high-level library optimization - They know the most about what is going on 13 Future • Creation and adoption of parallel high-level I/O libraries should make things easier for everyone – New domains may need new libraries or new middleware – HLLs that target database backends seem obvious, probably someone else is already doing this? • Further evolution of components necessary to get best performance – Tuning/extending file systems for HPC (e.g. user metadata storage, better APIs) • Aggregation, collective I/O, and leveraging semantics are even more important at larger scale – Reliability too, especially for kernel FS components • Potential HW changes (MEMS, active disk) are complementary 14