Preparatory Research on Performance Tools for HPC HCS Research Laboratory University of Florida November 21, 2003 Overview Background Evaluation Criteria Performance Tools Ideal tool Quantitative categories Tool descriptions Potential use and modifications Conclusions and Future Plans 11/21/03 2 Background High-performance computing Drastic increase in complexity Performance tools cannot keep pace Tools used for performance analysis, to identify bottlenecks, and enable source code and compilation optimizations Performance tools for HPC 11/21/03 Performance tools are often afterthoughts Performance tools do not sell systems, therefore considered less important by vendors 3 Background UPC Roots = Split-C, AC, Parallel C Preprocessor Takes best aspects of each Parallel extension of ANSI C standard Allows parallel programming in familiar C style Abstracts communication between threads Shared-memory programming model offers many advantages Programs less complex than MPI and others, software more scalable Allows shared data structures between threads Targeted for both shared- and distributed-memory architectures 11/21/03 Challenge: achieving optimal performance Implementations for variety of systems and growing CC-NUMA, SMP, Clusters 4 Background SHMEM Single-ended, asynchronous communication library Remote write and read support without involvement or notification of remote CPU “Virtual” shared memory Only supported on Silicon Graphics and Cray systems 11/21/03 Direct memory-to-memory copy Uses explicit function calls (i.e. get and put) HCS lab evaluating options for clusters (e.g. via SCI, QsNet) Low-level functions and subroutines efficiently use hardware circuitry for low-overhead communication 5 Background Importance of performance tools Identify bottlenecks in program Poor mapping of program to architecture Unoptimized code Parallelization areas Compiler inefficiencies Provide insight on how code actually executes on specific architecture UPC and SHMEM have limited support TotalView (UPC): Debugging software CrayPat (SHMEM and UPC): Performance profiler Vampirtrace (SHMEM): Performance profiler 11/21/03 6 Tool Evaluation Criteria – Features Desirable features for UPC/SHMEM performance tools: Ability to profile each thread (and its variables) independently Be able to show breakdown of communication with remote threads Frequency of read/write memory on each remote thread Frequency of each remote thread read/write data on local thread Basic functional and block performance profiling (delay per function, total functions, total time in block) Highlight bottlenecks, points of contention Break down computational stalls (block for I/O, shared-memory access, data dependencies, etc) Communication delay on interconnect (SCI, GigE, IBA, QsNet, etc.) Real-time profiling Compiler independent Network independent Platform independent 11/21/03 7 Tool Evaluation Criteria – e.g. QFD Table Feature Weight Independent thread profiling Remote thread communication Basic functional and block profiling Breakdown of stalls Real-time profiling Communication delay on interconnect Compiler independent Network independent Platform independent Miscellaneous characteristics 11/21/03 8 Tools Overview UPC tools SHMEM tools TotalView = show code correctness not performance! CrayPat Vampir and VampirTrace General performance tools 11/21/03 Vampir and VampirTrace PAPI Perfometer Kojak SvPablo Paradyn TAU 9 UPC Tools: TotalView Overview Developed by Etnus, LLC Version 6.3 (supported UPC since version 6.1) Debugger only, no performance analysis Supports UPC, SHMEM, C/C++, Fortran, MPI, OpenMP, and others Supports UPC on SGI IRIX, HP Tru64 Features Commercial product Tests code modifications without recompilation Supports independent thread debugging Shared variable views On each thread Altogether (e.g. arrays) Other basic features Breakpoints Memory debugging Reliable handling of complex code Desired Enhancements Performance profiling of UPC and SHMEM Basic statistics gathering a starting point Minimize reduction in performance 11/21/03 10 SHMEM Tools: CrayPat Overview Developed by CRAY Performance analysis and tracing tool for Cray X1 Only works for Cray systems, not currently portable Provides run-time analysis and profiling of program performance Supports Fortran, MPI, MPI2, Pthreads, SHMEM, and UPC Features Provided with Cray systems At cost of added complexity can provide extreme levels of detail Needs rebuilding of application with CrayPat instrumentation code and libraries Replacement for Cray SV1 performance analysis and profiling tools Provides direct access to read hardware performance counters Allows user to aggregate, display, format, and export collected performance data in various different ways Provides I/O performance profiling for Fortran, asynchronous, and system call routines Command line interface Desired Enhancements Support for architectures other than Cray systems User-friendly GUI 11/21/03 11 Performance Tools: Vampirtrace Overview Developed by Pallas Version 4.0 Supports all platforms that use GNU Compiler Collection to compiler C or Fortran code Supports Java, C, and Fortran Supports MPI, Global Array programming model and SHMEM Features Commercial product Vampirtrace = generates program trace Vampir = GUI used to analyze trace Supports multithreaded MPI programs Link Vampirtrace library during compilation Can also record arbitrary user-defined events Entry and exits from subroutines Execution of code blocks Filtering mechanism to focus on user-defined events and statistics Desired Enhancements Support UPC programming model 11/21/03 12 Performance Tools: PAPI PAPI Software Interface Overview Developed at Innovative Computing Laboratory at U. of Tennessee, Knoxville Version 2.3.4.2 released May 2003 Monitors computation events using hardware counters available on modern processors Available for Windows, Linux, UNIX platforms Portable Platform Independent Layer Trace/Profiling Tool Features Open source, free download of full version Consists of two layers of software Portable Platform Independent Layer — API Platform Specific Layer — Interface substrate that allows API to communicate with hardware counters via patched kernel, operating system, or directly Linux systems must have kernel patched with perfctr tool to allow access to hardware counters Provides two interfaces High-level interface for simple measurements and purposes Low-level interface for more complex and sophisticated purposes Many tools feature optional support for PAPI Additional features available when tool is configured with PAPI support SvPablo, Perfometer, Visual Profiler, among others Example metrics: L1 data cache misses, cache line invalidation, floating-point stalls, instructions per second Desired Enhancements Addition of new PAPI metrics that reflect key issues directly relating to UPC/SHMEM 11/21/03 High-Level API Low-Level API Machine Dependent Substrate OR OR Patched Kernel Operating System Hardware Performance Counters Platform Specific Layer 13 Performance Tools: Perfometer Overview Developed at U. of Tennessee, Knoxville Version 1.1 released September 12, 2002 Works with any system with PAPI support Requires Java for GUI Provides run-time visualization of program performance Supports C/C++ programs and has MPI support Features Open source, free download of full version Monitors both local and remote applications GUI and backend communicate through ports Returns information on processor and executables for each application Has alarms that pause program when data monitoring thresholds are reached Able to pause and continue program execution Requires perfometer() call inserted in program to enable monitoring Mark_perfometer() call allows user to change color of graph to see trends of different sections of code Desired Enhancements Support UPC and SHMEM programming models 11/21/03 14 Performance Tools: Kojak Overview Collaborative research project of U. of Tennessee, Knoxville and Research Centre Juelich (Germany) Version .99 released Nov. 4 2003 (3rd release) Available for Linux IA-32, IBM Power3/Power4, SGI Mips, IA-64, SUN SPARC Supports MPI 1.2 and OpenMP, as well as uniprocessor applications Features Open source, free download of full version No modifications to source code needed EPILOG trace file generated at program run-time OPARI tool (also part of TAU) provides automatic instrumentation Custom modifications can also be conducted to permit closer examination of arbitrary function calls Open trace file format Support for conversion to VAMPIR format for analysis with VAMPIR tools EXPERT module provides automatic analysis of EPILOG trace files Tool is geared towards identifying performance problems Range of problems known to EXPERT is flexible and extendable Desired Enhancements Support UPC and SHMEM programming models EXPERT Pre-Defined Monitored Properties 11/21/03 15 Performance Tools: SvPablo Overview Developed at U. of Illinois, Urbana-Champaign Version 5.2 released March 2003 Tool to help developers “tune” their software for better performance and help them eliminate bottlenecks Available for Sun Solaris, SGI IRIX, IBM SP, Compaq Alpha and Linux Supports C, Fortran 77/90, HPF, MPI and OpenMP Features Open source, free download of full version Interactive instrumentation of code via GUI Link SvPablo library during compilation Provides performance data Traces loops and function calls But does not trace all instructions Provides statistical data Counts how many times a function was executed Records execution time of function Correlates performance data with source code PAPI support Desired Enhancements Support UPC and SHMEM programming models 11/21/03 16 Performance Tools: Paradyn Overview Developed by U. of Wisconsin, Madison Version 4.0 released May 31, 2003 Visuals include time-plots, bar graphs, and tables Available for Solaris (SPARC), Linux (x86), Windows NT and 2000 (x86), and AIX (RS6000) Supports Fortran, C/C++, Java, and MPI Features Open source, free download of full version No modifications to source or binary Dynamic instrumentation with real-time reporting Can focus on specific portions of a program and on specific performance parameters Records many different performance statistics such as CPU time, send/receive message count and sizes, sync time, and IO time Performance Consultant executes automated performance bottleneck search Hypothesizes main bottleneck of program or chunks of program Bottlenecks classified as CPUbound, ExcessiveSyncWaitingTime, ExcessiveIOBlockingTime, TooManySmallIOOps Desired Enhancements 11/21/03 Support UPC and SHMEM programming models 17 Performance Tools: TAU Overview Developed at U. of Oregon TAU = Tuning and Analysis Utilities Portable Profiling Package Available for SGI, Origin 2K, IBM SP2, Cray T3E, Sun, Windows 95/98/NT, Linux (x86) Supports C/C++, Java, Fortran 77/90, HPF, HPC++, and MPI Features Open source, free download of full version Maintains performance data for each thread, context, and node used in parallel, multi-threaded programs Captures data for functions, basic blocks Three methods of instrumentation 1. 2. 3. Automatic via TAU Program Database Toolkit Manually via TAU instrumentation API Automatic at run-time via tau_run instrumentor DyninstAPI dynamic instrumentation package Racy = GUI analyzer used to find bottlenecks PAPI support Fast, reliable support Desired Enhancements Support UPC and SHMEM programming models 11/21/03 18 Conclusions and Future Plans Few tools support UPC or SHMEM programming models Many performance analysis tools for message-passing programs We must bridge this gap TotalView does not analyze performance CrayPat is not a portable tool Bring performance analysis to UPC and SHMEM tools Bring UPC and SHMEM support to performance tools Determine most feasible approach and pursue Focus on key issues at multiple levels; language, mapping, architecture Projected milestones/deliverables in proposed two-year project Year 1 Comprehensive survey and evaluation of HPC performance tools Investigation of key performance attributes in UPC and SHMEM Investigation of key performance attributes in existing/emerging system architectures Refinement of evaluation criteria and QFD table to identify primary approach Year 2 Development of prototype performance tools for HPC Performance benchmarking and optimization on selected system architectures Investigation of usability and productivity achieved with these tools 11/21/03 19