SUPER Project: Oak Ridge National Laboratory Team Update: 9/18/2013 Eduardo D’Azevedo Philip C. Roth Sarat Sreepathi Patrick H. Worley (Site Contact) Computer Science and Mathematics Division Oak Ridge National Laboratory Oak Ridge, TN USA SUPER All Hands Meeting Oakland, CA September 18, 2013 ORNL Update Overview • ORNL SUPER team is involved in several activities – Engagement – End-to-end/Integration – Tool Development – I/O Analysis and Optimization – Communication Analysis and Optimization • Personnel – Lost Gabriel Marin to UT-K in February 2013 – Added Sarat Sreepathi, Ed D’Azevedo 2 SUPER All-Hands Meeting, September 18-19, 2013 Engagement • Managed SUPER engagement with Science Application Partnership (SAP) projects • Participated in 4 SAP projects (with SAP funding) • FY13 SAP engagement activities and highlights – FES: Plasma Surface Interactions (PSI) • Performance analysis of SOLPS, an existing plasma surface interactions code; began optimization by porting parts to accelerator • Provided guidance on design for lightweight performance data collection infrastructure within XOLOTL, a plasma surface interaction code being developed as part of the PSI project 3 SUPER All-Hands Meeting, September 18-19, 2013 Engagement • FY13 SAP engagement activities and highlights – FES: Center for Edge Physics Simulation (EPSi) • Contributed to 4X performance improvement on Titan (and to smaller improvements on other systems) • Documented performance for INCITE, ERCAP, highlights, ... on Edison, Mira, and Titan. Results will be presented in a SC13 poster. • Continued development and maintenance of infrastructure for automatic performance data capture • Developed "performance variability" fault tolerance infrastructure • Packaged and maintained benchmark version of code for SUPER, plus ongoing support • Ongoing evaluation of new code / new problem sizes; contributing to planning 4 SUPER All-Hands Meeting, September 18-19, 2013 EPSi Highlights 5 SUPER All-Hands Meeting, September 18-19, 2013 Engagement • FY13 SAP engagement activities and highlights – BER: Predicting Ice Sheet and Climate Evolution at Extreme Scales (PISCEES) • Evaluated performance of "old" model, identifying how to more than double performance on Titan • Evaluated performance of one of the "new" models, identifying need for more work in new solver (ML) • Developed infrastructure for collecting performance data from Trilinos compatible with existing performance data infrastructure • Ongoing development of infrastructure for including performance tests as part of V&V test suite 6 SUPER All-Hands Meeting, September 18-19, 2013 PISCEES Highlights SEACISM Performance: 5km GIS (No I/O) Cray XK7 (1 sixteen-core processor per node) using all cores per node 10000 with evolving temperature GNU, overlap 0 GNU, overlap 1 PGI, overlap 1 with constant temperature June 2012: PGI, overlap 1 8000 Timesteps per Day 7000 16000 14000 Timesteps per Day 9000 SEACISM Performance: 5km GIS with evolving temperature 18000 6000 5000 4000 12000 10000 8000 6000 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 Processor Cores 7000 8000 9000 256 Cray XE6 (two 12-core AMD proc. per node) ILU/ML preconditioner ILU/Belos preconditioner ILU/AztecOO preconditioner 128 64 32 16 1024 2048 4096 Processor Cores 7 SUPER All-Hands Meeting, September 18-19, 2013 0 0 2000 4000 6000 8000 10000 12000 Allocated Processor Cores FELIX Ice Sheet Model Dynamical Core Performance Steady State Solution on 2km Greenland Ice Sheet Grid Seconds for Nonlinear System Solution Cray XK7 (1 sixteen-core processor per node) using every other core per node and and overlap of 0 No IO (GNU, more opt.) No IO (GNU) No IO (PGI) 4000 2000 8192 16384 14000 16000 18000 Engagement • FY13 SAP engagement activities and highlights – BER: Applying Computationally Efficient Schemes for BioGeochemical Cycles (ACES4BGC) • Continued development and maintenance of infrastructure for automatic performance data capture for Community Earth System Model (many versions; moving target) • 1.6X performance improvement for a suite of production runs • 2X performance improvement for a (different) suite of production runs • Packaged and maintained MOAB kernel (from Tim Tautges) for SUPER 8 SUPER All-Hands Meeting, September 18-19, 2013 Engagement • FY13 SAP engagement activities and highlights – BER: Multiscale Methods for Accurate, Efficient, and ScaleAware Models of the Earth System (no SAP funding) • Made available infrastructure for collecting performance data from Trilinos to the group developing implicit discreatizations of the atmosphere model • Contributing to ongoing performance evaluation and optimization 9 SUPER All-Hands Meeting, September 18-19, 2013 End-to-End/Integration • Continued development of data collection infrastructure and “marketed” to SAP projects, primarily in FES and BER so far • Collaborated with LLNL and Univ. of Oregon on analyzing data being collected (see LLNL slides) • Leveraged infrastructure in developing performance variability “aware” simulation control logic for EPSi, and currently designing similar capability for ACES4BGC • Extending infrastructure to support I/O analysis and optimization activity (new activity) • Extending infrastructure to support communication analysis and optimization activity (new activity) 10 SUPER All-Hands Meeting, September 18-19, 2013 Application Characterization (mpiP) • Oxbow application characterization activity • Enhanced mpiP with ability to capture communication topology for point-to-point and for collective communication operations • Paper describing Oxbow application characterization submitted to Performance Modeling, Benchmarking, and Simulation of High Performance Computer Systems (PMBS13) (in review) Number of messages 1e+06 MPIRandomAccess PTRANS MPIFFT HPL 100000 10000 1000 100 10 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 log_2(Msg size bin upper bound, exclusive) Message sizes, HPCC, selected phases 11 SUPER All-Hands Meeting, September 18-19, 2013 HPCC MPI Random Access (!) communication topology (Visualized using VisIt) Application Analysis: Value Influence Tracking • Understanding how values are propagated through time and space can help us recognize – Inefficient/unnecessary computation (e.g., cut-off distance) – Incorrect computation (e.g., this value should have been accessed) – Values for which high reliability is needed • Value influence tracking is a direct, on-line approach for tracking how a value contributes to later computation (its influence) in multithreaded and MPI programs • VIT tool, based on Intel Pin and PMPI profiling interface, implements this approach – Dynamic instrumentation associated with individual instructions propagates influence data – PMPI versions of functions propagate influence data between address spaces • P.C. Roth, “Tracking a Value’s Influence on Later Computation,” 2013 Workshop on Productivity and Performance (PROPER 2013), Aachen, Germany, August 2013. u: 0.3! Opera&on:*u*+*v* Influence*operator:*average* v: 0.5! 12 SUPER All-Hands Meeting, September 18-19, 2013 dest: 0.4! Task 0! Task 1! Main thread! Main thread! A[0]: 0.3! A[1]: 0.7! A[2]: 0.4! …! MPI_Recv(B,…)! MPI_Send(A,…)! PMPI_S end of A! PMPI_S en IA[1] , I d of IA[0] ,! A[2] , …! B[0]: 0.3! B[1]: 0.7! B[2]: 0.4! …! Value Influence Tracking: Example • 2D heat transfer application (5 point stencil) Value influence propagation, starting with value on left boundary. Cells colored according to magnitude of influence associated with that data point. 13 SUPER All-Hands Meeting, September 18-19, 2013 I/O Characterization and Optimization • New activity (sort of) – Builds on work from SciDAC 2 Petascale Data Storage Institute, and work with mpiP – Leverages end-to-end data collection infrastructure – Motivated by a number of different engagement activities • Personnel: Roth (lead), Sreepathi, Worley • Research directions – Techniques for monitoring application I/O behavior and correlating it with system activity to diagnose causes of observed I/O performance variability – I/O scheduling techniques to reduce conflicts – On-line auto-tuning of I/O-related parameters 14 SUPER All-Hands Meeting, September 18-19, 2013 Communication Optimization • New activity – Also leverages end-to-end data collection infrastructure – Also motivated by a number of different engagement activities • Personnel: D’Azevedo (lead), Roth, Worley • Research directions – Runtime techniques for process placement that minimizes communication overhead based on (initially) offline communication characterization and online information about allocated nodes and network topology, with special attention to • applications with multiple communication phases, each with different mapping preferences • coupled models in which mapping must take into account dependencies between components but for which multiple components can be mapped to the same nodes 15 SUPER All-Hands Meeting, September 18-19, 2013 Communication Optimization: 3mm DIII-D and 5.5mm ITER on Titan • XGC1 uses a logical 2D processor grid. Different phases of code prefer different mappings of process grid to physical nodes and cores, and importance increases when multiple processes assigned to a single node. These “trends” are also sensitive to the nodes allocated for a given run, which is not controllable on Titan (or Hopper or Edison) currently. • Optimal mapping different for 5.5mm ITER grid than for DIII-D. Note weird non-power-oftwo behavior for Dimension 2 major ordering. 16 SUPER All-Hands Meeting, September 18-19, 2013 Summary • The ORNL SUPER team is actively involved in – Engagement – End-to-end/Integration – Application Characterization/Analysis/Optimization and development of enabling techniques and tools For more information, contact • Phil Roth at rothpc@ornl.gov • Pat Worley at worleyph@ornl.gov 17 SUPER All-Hands Meeting, September 18-19, 2013