DECOUPLED EXECUTION PARADIGM FOR DATA-INTENSIVE HIGH-END COMPUTING Yong Chen Data-Intensive Scalable Computing Laboratory Department of Computer Science Texas Tech University 11/15/12 2 About Me • Assistant Professor, director and faculty member of the Data-Intensive Scalable Computing Laboratory (DISCL) • My research focuses on data-intensive computing, parallel and distributed computing, high-performance computing, Cloud computing, computer architectures and systems software support for high-performance scientific computing/high-end enterprise computing 3 11/15/12 Wordle of My Current Publication Titles Acknowledge: http://www.wordle.net/ 11/15/12 Outline • Introduction and Background • Decoupled Execution Paradigm • Theoretic Modeling and Analysis • Data Dependence and Resource Contention • Evaluations • Conclusion 4 11/15/12 5 High-End Computing/High-Performance Computing • A form of parallel computing, with a focus on performance • Fundamental limits of serial computers are being approached • A strategic tool for scientific discovery and innovations • Solve “grand challenge” problems • Understand the phenomenon behind data • Computer simulation and analysis complement theory and experiments 6 Petaflops System A Typical HEC System: ANL Intrepid Rack 72 Racks Cabled 8x8x16 32 Node Cards 1024 chips, 4096 procs IBM Blue Gene/P architecture Node Card (32 chips 4x4x2) 32 compute, 0-2 IO cards 1 PF/s 144 TB 14 TF/s 2 TB Compute Card 1 chip, 20 DRAMs 435 GF/s 64 GB Chip 4 processors 850 MHz 8 MB EDRAM 13.6 GF/s 2.0 GB DDR Supports 4-way SMP Front End Node / Service Node System p Servers Linux SLES10 Source: ANL ALCF Note: Data Not Latest Maximum System 256 racks 3.5 PF/s 512 TB HPC SW: Compilers GPFS ESSL Loadleveler 7 11/15/12 Scientific Applications Trend • Applications tend to be data intensive • Scientific simulations, data mining, large-scale data processing, etc. • A GTC run on 29K cores on the Jaguar machine at OLCF generated over 54 Terabytes of data in a 24 hour period PI Data requirements for selected INCITE applications at ALCF On-Line Data Project Lamb, Don Fischer, Paul Dean, David Baker, David Worley, Patrick H. Wolverton, Christopher Washington, Warren Tsigelny, Igor Tang, William Sugar, Robert Siegel, Andrew Roux, Benoit FLASH: Buoyancy-Driven Turbulent Nuclear Burning 75TB Reactor Core Hydrodynamics 2TB Computational Nuclear Structure 4TB Computational Protein Structure 1TB Performance Evaluation and Analysis 1TB Kinetics and Thermodynamics of Metal and 5TB Complex Hydride Nanoparticles Climate Science 10TB Parkinson's Disease 2.5TB Plasma Microturbulence 2TB Lattice QCD 1TB Thermal Striping in Sodium Cooled Reactors 4TB Gating Mechanisms of Membrane Proteins 10TB Source: R. Ross et. al., Argonne National Laboratory Off-Line Data 300TB 5TB 40TB 2TB 1TB 100TB 345TB 50TB 10TB 44TB 8TB 10TB 8 11/15/12 Atmospheric Science • A huge number of sensors deployed across the world • They record data every 3 hours Sensors deployment across the world (from NOAA, National Oceanic and Atmospheric Administration) 9 11/15/12 Execution Paradigm of High-End Computing: State of the Art • Current HEC execution models and their associated runtime systems, however, are computing-centric • Systems architecture • Programming model (e.g. message passing interface, MPI) • Runtime (e.g. MPI library) Compute node Compute node Compute node Interconnect Storage Storage Abstracted HEC Systems d a t a 11/15/12 10 Execution Paradigm of High-End Computing: State of the Art • Not ready to support efficient I/O for data-intensive HEC • MPI focuses on exchanging in-memory data • HEC performance is commonly measured in terms of peak perf. of small computation kernels fitting into memory and cache well • Data-driven IT industry has developed a new paradigm, MapReduce, for their needs • Great need for the HEC community to rethink the execution models for the coming data-intensive HEC era 11/15/12 11 Decoupled Execution Paradigm • We propose a new Decoupled Execution Paradigm (DEP) for Data- Intensive High-End Computing • Introduce the notion of separation compute nodes & data (processing) nodes • Decouples execution into computation-intensive and data-intensive ops • Data nodes take care of data-intensive operations collectively • Compute nodes take care of computation-intensive operations collectively • Application is executed in a decoupled but fundamentally more efficient manner for data-intensive HEC with the collective support • Rethinking of execution paradigm where I/O intensive operation is as important as computation • Providing balanced computation and data-access capabilities • Preliminary results have shown promise and potential Y. Chen, C. Chen, X.-H. Sun, W. D. Gropp, and R. Thakur. A Decoupled Execution Paradigm for Data-Intensive High-End Computing. In the Proc. of the IEEE International Conference on Cluster Computing 2012 (Cluster’12), 2012. 11/15/12 12 Motivating Example • Data commonly represented by a multi- dimensional array-based data model • Read required data from storage servers to compute nodes • Perform computations on desired data with specified conditions, and then write back • With clear data retrieval and processing phases and computing and simulation phases • Data access and movement often dominate execution time for data-intensive HEC apps Processing 3-dimensional Temperature Data In Community Earth System Model (CESM) 13 11/15/12 Decoupled Execution Paradigm Design Applications Decoupled Execution Programming Model (DEPM) (MPI Extension) Decoupled Execution Systems Architecture Compute-side Data Nodes Storage-side Data Nodes Interconnect Compute Nodes Local SSD storage Decoupled Execution Runtime System (DERS) Message Passing Library Local SSD storage Decoupled Execution Runtime System (DERS) Data Processing Library 11/15/12 14 DEP System Architecture • Nodes decoupled: compute nodes, compute-side data nodes and storage-side data notes • Compute-side data nodes reduce the size of computing generated data before sending it to storage nodes • Storage-side data nodes reduce the size of data retrieved from storage before sent • Data nodes can provide simple data forwarding, but • The idea is to conduct decoupled data-intensive operations and optimizations to reduce the data size and movement • Compute nodes take care of computation-intensive operations collectively 11/15/12 15 DEP Programming Model • To determine operations to be passed to data nodes • Designed as an MPI extension, allowing users to specify operations conducted on data nodes • Results sent back to compute nodes for further processing • Similar to netCDF Operators, but allowing data-intensive operations to be decoupled and processed on data nodes • For instance, an ncwa operator computes the weighted average on specified data and returns the result for further computations • Extended and more powerful • Allowing operations to be decoupled not only operators • Allowing optimizations across operations 11/15/12 16 DEP Runtime System • Relies on two libraries, message passing library and data processing library • Message passing library focuses on the memory abstraction and provides support for computation-intensive operations • Leverage the existing MPI library for this purpose • Data processing library focuses on the I/O abstraction and provides support for data-intensive operations • Two libraries are tightly coupled and and the message passing library manages the interaction between these two libraries • Can optimize user-defined data-intensive operations and other I/O optimization operations on data nodes as well 17 11/15/12 Comparison of Execution Paradigms Retrieval Retrieval Reduce Bottleneck Compute Compute Reduce Store Conventional Execution Paradigm Reduced latency and improved access Reduced data movement and network transmission Store Decoupled Execution Paradigm 21 11/15/12 Data Dependence and Dynamic Data Distribution • Data Dependence is caused by two factors: • Data access patterns of operations • Data distribution in file system • Dynamic Data Distribution proposed Strip L Strip 1 1 Terrain map 2 3 4 5 … … N-4 N-3 N-2 latitude N-1 N longitude 2 SFD: Single flow direction 3 s1 s2 s3 Operation … s4 s s5 Data distribution M-3 s6 s7 s8 4 M-2 M-1 M MFD: Multiple flow direction Strip o Possible data distribution Strip q Server a Server b Server c Analysis Kernel Analysis Kernel Analysis Kernel Disk Strip o … Disk Strip p … C. Chen and Y. Chen. Dynamic Active Storage for High Performance I/O. In The 41st International Conference on Parallel Processing (ICPP’12), 2012. Disk Strip q 22 11/15/12 Resource Contention and Solution • HEC system may run dozens of applications simultaneously • Resource contention degrades overall system performance • Dynamic operation scheduling proposed p1 APP1 p2 pn APP2 p2 p1 AI AI AI NI: Normal I/O AI pn AI AI APP m p2 p1 NI pn NI NI NI NI I/O requests NI AI: Active I/O Each I/O Requests 256MB Data Data Node AI NI NI Data Node Data Node AI AI NI NI AI AI NI NI AI I/O queue Execution Time (s) m<n 80 70 60 50 40 30 20 10 0 AS TS 3 5 7 # of I/Os per storage node C. Chen, Y. Chen, and P. C. Roth. DOSAS: Mitigating the Resource Contention in Active Storage Systems. Accepted to appear in the IEEE International Conference on Cluster Computing 2012 11/15/12 23 Preliminary Results and Experimental Platform • Experimental platform • A 640-node Linux cluster • Node equipped with Intel(R) Xeon(R) 2.8GHz CPUs (12 cores per node) and 24GB memory • Two application kernels • Kernel calculation of the CESM that computes the moving average of selected area of specified data • Flow routing and flow accumulation calculations in geographic information system Flow Directions in a Grid of Terrains. Numbers represent the gradient of each terrain. Arrows represent the direction of water flow 24 11/15/12 Results of the CESM Kernel Code '%!" '#!" '!!" '#!" '!!" )*+,-+.*+/0" &!" 123" %!" $!" Execution Time (s) Executtion Time (s) '$!" &!" )*+,-+.*+/0" %!" 123" $!" #!" #!" !" '#" #$" $&" (%" !" 4GB5" '#" #$" $&" (%" 4GB5" '#!" )!" '!!" (!" &!" )*+,-+.*+/0" %!" 123" $!" #!" !"#$%&'()*+,#)-./) !"#$%&'()*+,#)-./) Execution Time of CESM Kernel Code with Different Data Sets on 48 Nodes '!" &!" ,-./0.1-.23" %!" 456" $!" #!" !" '#" #$" $&" (%" (GB) !" #$" $&" &*" +(" Execution Time of CESM Kernel Code with Different Data Sets on 96 Nodes 7GB8" 25 11/15/12 Results of the CESM Kernel Code %#!!" Bandwidth (MB/s) %!!!" $#!!" *+,-.,/+,01" 234" $!!!" Effective Bandwidth of CESM Kernel Code with Different Data Sets on 96 Nodes, with 48 storage-side data nodes #!!" '!!!" $%" %&" &'" ()" (GB) Effective Bandwidth of CESM Kernel Code with Different Data Sets on 96 Nodes, with 24 storage-side data nodes &#!!" Bandwidth (MB/s) !" &!!!" %#!!" +,-./-0,-12" %!!!" 345" $#!!" $!!!" #!!" !" $%" %'" '(" )*" (GB) 26 11/15/12 *!" +!" )!" *!" (!" )!" '!" +,-./-0,-12" &!" 345" %!" !"#$%&$'()*MB/s+) !"#$%&$'()*+!,-.) Results of the GIS Kernel Code (!" '!" $!" #!" #!" #*" %(" &*" )$" 456" %!" $!" !" ,-./0.1-.23" &!" !" (GB) #*" %(" &*" )$" (GB) Effective Bandwidth of Flow Routing and Accumulation Code with Different Data Sets on 24 Nodes +!" '%!" *!" '$!" (!" '!" ,-./0.1-.23" &!" 456" %!" $!" #!" Bandwidth (MB/s) bandwidth (MB/s) )!" '#!" '!!" *+,-.,/+,01" &!" 234" %!" $!" #!" !" #*" %(" &*" )$" (GB) !" '&" Effective Bandwidth of Flow Routing and Accumulation Code with Different Data Sets on 48 Nodes (%" $&" )#" (GB) 11/15/12 27 Related Work and Comparison • Extensive studies have focused on improving the performance of data-intensive HEC at various levels • Compare with three levels of work • Architecture, programming model, and runtime system levels • Architecture improvements for data-intensive HEC • Nonvolatile storage-class memory devices promising but cannot reduce the data movement across the network • Active storage and active disks offload computations but designed for either idle computing power or an embedded processor • DEP provides a more powerful platform for the same purpose • I/O forwarding and data shipping offload I/O requests too. Data nodes in the DEP design can carry all these functions and do more 11/15/12 28 Related Work and Comparison • Programming model improvements for data-intensive HEC • Current models designed for computation-intensive applications • Include MPI, Global Arrays, Unified Parallel C, Chapel, X10, Coarray Fortran, and HPF • I/O through a subset of interfaces such as MPI-IO • MapReduce an instant hit, but typically layered on top of distributed file systems and not designed for high performance semantics • Runtime system improvements for data-intensive HEC • Advanced I/O libraries, HDF, PnetCDF, ADIOS • Collective I/O, data sieving, server-directed I/O, disk-directed I/O • Caching, buffering, staging, and prefetching optimization strategies • Data nodes in DEP work for both reads and writes, and can provide buffering or staging, but more importantly on reduction 11/15/12 29 Conclusion and Future Work • I/O has been widely recognized as a bottleneck in high- end computing for data-intensive scientific discovery • The bottleneck and massive amount of data movement can largely limit the productivity of data-intensive sciences • Contribution of this research • Study a decoupled execution paradigm (DEP) for data-intensive high-end computing • Separate data-processing nodes and compute nodes, decomposes application operations, and maps onto decupled nodes • Data and compute nodes collectively provide a balanced design • Verify with an initial prototype and results promising • An initial step of trying a new execution paradigm • Further working on each component 11/15/12 U-Reason Seminar Thank You For more information please visit: http://data.cs.ttu.edu/dep 30