Computer Science Department University of Texas at El Paso PCAT PAPI Development Team Performance Counter Assessment Team PAPI Deployment, Evaluation, and Extensions Shirley Moore, Daniel Terpstra, Kevin London, and Philip Mucci University of Tennessee-Knoxville Patricia J. Teller, Leonardo Salayandia, Alonso Bayona, and Manuel Nieto University of Texas-El Paso . . UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PAPI Development Team Main Objectives PCAT Performance Counter Assessment Team Provide DoD users with a set of portable tools and accompanying documentation that enables them to easily collect, analyze, and interpret hardware performance data that is highly relevant for analyzing and improving performance of applications on HPC platforms. UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI PAPI Development Team Computer Science Department University of Texas at El Paso Performance Application Programmer Interface PCAT Performance Counter Assessment Team A cross-platform interface to hardware performance counters Application Programmer Performance Monitoring Hardware PAPI high-level Interface • Routines to start, read, and stop counters • Specific list of event counts • Obtain performance data Vendor Interface Tool Developer Advanced User Low-level Interface •Thread-safe • Fully programmable • All native events and counting modes. • Callbacks on counter overflow • SVR4-compatible profiling UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team PAPI Standard Event Set Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team A common set of events that are considered most relevant for application performance analysis Included in this set are: • Cycle and instruction counts • Functional unit status • Cache and memory access events • SMP cache coherence events Many PAPI events are mapped directly to native platform events. Some are derived from two or more native events. Run avail to find out what standard events are available on a given platform. UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PAPI Development Team Deployment: Objectives PCAT Performance Counter Assessment Team Develop a portable and efficient interface to the performance monitoring hardware on DoD platforms. Install and support PAPI software and related tools on DoD HPC Center machines. Collaborate with vendors and users to add additional features and extensions. UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Deployment: Methodology - 1 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team Investigation of statistical sampling to reduce instrumentation overhead PAPI substrate for HP AlphaServer based on DADD/DCPI sampling interface 2-3% overhead vs. 30% using counting Investigating similar approach on other platforms Efficient counter allocation algorithm based on bipartite graph matching – improved allocation on IBM POWER platforms UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Deployment: Methodology-2 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team Need easy-to-use end-user tools for collecting and analyzing PAPI data TAU (Tuning and Analysis Utilities) from University of Oregon provides automatic instrumentation for profiling and tracing, with profiling based on time and/or PAPI data. PAPI being incorporated into VAMPIR perfometer graphical analysis tool provides real-time display and/or tracefile capture and replay. papirun command-line utility under development (similar to perfex and ssrun on SGI IRIX) dynaprof dynamic instrumentation tool under development UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team TAU Performance System Architecture Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team Paraver EPILOG UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Vampir v3.x: Hardware Counter Data Counter Timeline Display UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team Perfometer Parallel Interface UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team Deployment: Methodology-3 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team Memory utilization extensions allow users to obtain static and dynamic memory utilization information. Routines added to low-level API: PAPI_get_memory_info() PAPI_get_dmem_info() UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PAPI Development Team Deployment: Results PCAT Performance Counter Assessment Team PAPI implementations for the following DoD platforms: IBM POWER3/4 HP AlphaServer SGI Origin Sun UltraSparc Cray T3E Linux clusters Cray X1 implementation underway PAPI installed at all four MSRCs and ARSC and MHPCC, plan to install at additional DCs PAPI widely used by DoD application developers and HPCMO benchmarking team Encouraging vendors to take over responsibility for implementing and supporting PAPI machine-dependent substrate More information: http://icl.cs.utk.edu/papi/ UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PAPI Development Team Evaluation: Objectives PCAT Performance Counter Assessment Team Understand and explain counts obtained for various PAPI metrics Determine reasons why counts may be different from what is expected Calibrate counts, excluding PAPI overhead Work with vendors and/or the PAPI team to fix errors Provide DoD users with information that will allow them to effectively use collected performance data UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Evaluation: Methodology - 1 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team 1. Micro-benchmark: design and implement a micro-benchmark that facilitates event count prediction 2. Prediction: predict event counts using tools and/or mathematical models 3. Data collection-1: collect hardware-reported event counts using PAPI 4. Data collection-2: collect predicted event counts using a simulator (not always necessary or possible) UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Evaluation: Methodology - 2 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team 5. Comparison: compare predicted and hardware-reported event counts 6. Analysis: analyze results to identify and possibly quantify differences 7. Alternate approach: when analysis indicates that prediction is not possible, use an alternate means to either verify reported event count accuracy or demonstrate that the reported event count seems reasonable UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Example Findings - 1 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team • Some hardware-reported event counts mirror expected behavior, e.g., number of floating-point instructions on the MIPS R10K and R12K. • Other hardware-reported events can be calibrated, by subtracting that part of the event count associated with the interface (overhead or bias error), to mirror expected behavior , e.g., number of load instructions on the MIPS and POWER processors and instructions completed on the POWER3. • In some cases, compiler optimizations effect event counts, e.g., the number of floating-point instructions on the IBM POWER platforms. UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Example Findings - 2 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team • Very-long instruction words can affect event counts, e.g., on the Itanium architecture the number of instruction cache misses and instructions retired are dilated by noops used to compose very long instruction words. • The definition of the event count may be non-standard and, thus, the associated performance data may be misleading, e.g., instruction cache hits on the POWER3. • The complexity of hardware features and lack of documentation can make it difficult to understand how to tune performance based on information gleaned from event counts—example: data prefetching, TLB walker. UGC 2003, Bellevue, WA – June 9-13, 2003 PAPI Development Team Example Findings - 3 Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team • Although we have not been able to determine the algorithms used for prefetching, the ingenuity of these mechanisms is striking. • In some cases, more instructions are completed than issued on the R10K. • The DTLB miss count on the POWER3 varies depending upon the method used to allocate memory (i.e., static, calloc or malloc). • Hardware SQRT on POWER3 not counted in total floating-point operations unless combined with another floating-point operation. UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PAPI Development Team Calibration Example - 1 Instructions completed PAPI overhead: 139 on POWER3-II Number of C-level instructions 0 (base) 10 100 1000 10000 100000 0 34 340 3400 34000 340000 139 173 479 3539 34139 340139 Standard Deviation 0 0 0 0 0 0 Reported - Predicted 139 139 139 139 139 139 Predicted Count Mean Reported Count UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT Performance Counter Assessment Team Computer Science Department University of Texas at El Paso PAPI Development Team Calibration Example - 2 Instructions completed PAPI overhead: 141 for small microbenchmarks Number of C-level instructions Predicted Count Mean Reported Count Standard Deviation Reported – Predicted 0 (base) 10 100 1000 10000 100000 0 34 340 3400 34000 340000 141 175 481 3541.04 34152.57 340267.9 0 0 0 1.14e-5 0.000339 0.000373 141 141 141 141.04 152.57 267.90 UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT Performance Counter Assessment Team PAPI Development Team RIB/OKC for Evaluation Resources Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team • Object-oriented data model to store benchmarks, results and analyses • Information organized for ease of use by colleagues external to PCAT • To be web-accessible to members • Objects linked between them as appropriate Benchmark Case General description of a benchmark Specific implementation and results Machine Organization Description of platform Contact information UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT RIB/OKC PAPI Development Team Data Repository Example Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team Benchmark name: DTLB misses Development date: 12/2002 Benchmark type: Array Abstract:Code traverses though an array of integers once at regular strides of PAGESIZE. The intention is to create compulsory misses on each array access. Input parameters are: Page size (bytes) and array size (bytes). The number of misses normally expected should be: Array Size / Page Size. Links to files Files included: dtlbmiss.c, dtlbmiss.pl About included files: dtlbmiss.c, benchmark source code in C, requires pagesize and arraysize parameters for input and outputs PAPI event count. dtlbmiss.pl, perl script that executes the benchmark 100 times for increasing arraysize parameters and saves benchmark output to text file. Script should be customized for pagesize parameter and arraysize range. UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT RIB/OKC PAPI Development Team Example Case Object Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team Name: DTLB misses on Itanium Date: 12/2002 Compiler and options: gcc ver 2.96 20000731 (Red Hat Linux 7.1 2.96-101) –O0 PAPI Event: PAPI_TLB_DM, Data TLB misses Native Event: DTLB_MISSES Experimental methodology: Ran benchmark 100 times with perl script, averages and standard deviations reported Input parameters used: Page size = 16K, Array size = 16K – 160M (increments by multiples of 10) Platform used: HP01.cs.utk.edu (Itanium) Developed by: PCAT Benchmark used: DTLB misses Links to other objects UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT RIB/OKC PAPI Development Team Computer Science Department University of Texas at El Paso PCAT Example Case Object Performance Counter Assessment Team Results summary: Reported counts closely match the predicted counts, showing differences close to 0% even in the cases with a small number of data references, which may be more susceptible to external perturbation. The counts indicate that prefetching is not performed at the DTLB level. Included files and description: - dtlbmiss.itanium.c: Source code of benchmark, instrumented with PAPI to count PAPI_TLB_DM - dtlbmiss.itanium.pl: Perl script used to run the benchmark - dtlbmiss.itanium.txt: Raw data obtained, each column contains results for a particular array size, each case is run 100 times (i.e., 100 rows included) - dtlbmiss.itanium.xls: Includes raw data, averages of runs, standard deviations and graph of % difference between reported and predicted counts - dtlbmiss.itanium.pdf: Same as dtlbmiss.itanium.xls UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PAPI Development Team Contributions PCAT Performance Counter Assessment Team Infrastructure that facilitates user access of hardware performance data that is highly relevant for analyzing and improving the performance of their applications on HPC platforms. Information that allows users to effectively use the data with confidence. UGC 2003, Bellevue, WA – June 9-13, 2003 Computer Science Department University of Texas at El Paso PAPI Development Team QUESTIONS? UGC 2003, Bellevue, WA – June 9-13, 2003 PCAT Performance Counter Assessment Team