Preparatory Research on Performance Tools for HPC HCS Research Laboratory University of Florida

advertisement
Preparatory Research on
Performance Tools for HPC
HCS Research Laboratory
University of Florida
November 21, 2003
Overview


Background
Evaluation Criteria



Performance Tools



Ideal tool
Quantitative categories
Tool descriptions
Potential use and modifications
Conclusions and Future Plans
11/21/03
2
Background

High-performance computing




Drastic increase in complexity
Performance tools cannot keep pace
Tools used for performance analysis, to
identify bottlenecks, and enable source
code and compilation optimizations
Performance tools for HPC


11/21/03
Performance tools are often afterthoughts
Performance tools do not sell systems,
therefore considered less important by vendors
3
Background

UPC

Roots = Split-C, AC, Parallel C Preprocessor


Takes best aspects of each
Parallel extension of ANSI C standard

Allows parallel programming in familiar C style


Abstracts communication between threads




Shared-memory programming model offers many advantages
Programs less complex than MPI and others, software more scalable
Allows shared data structures between threads
Targeted for both shared- and distributed-memory architectures

11/21/03
Challenge: achieving optimal performance
Implementations for variety of systems and growing

CC-NUMA, SMP, Clusters
4
Background

SHMEM

Single-ended, asynchronous communication library

Remote write and read support without involvement or
notification of remote CPU




“Virtual” shared memory
Only supported on Silicon Graphics and Cray systems


11/21/03
Direct memory-to-memory copy
Uses explicit function calls (i.e. get and put)
HCS lab evaluating options for clusters (e.g. via SCI, QsNet)
Low-level functions and subroutines efficiently use hardware
circuitry for low-overhead communication
5
Background

Importance of performance tools
 Identify bottlenecks in program




Poor mapping of program to architecture
Unoptimized code
Parallelization areas
Compiler inefficiencies
Provide insight on how code actually executes on
specific architecture
UPC and SHMEM have limited support
 TotalView (UPC): Debugging software
 CrayPat (SHMEM and UPC): Performance profiler
 Vampirtrace (SHMEM): Performance profiler


11/21/03
6
Tool Evaluation Criteria – Features

Desirable features for UPC/SHMEM performance tools:


Ability to profile each thread (and its variables) independently
Be able to show breakdown of communication with remote threads
Frequency of read/write memory on each remote thread

Frequency of each remote thread read/write data on local thread
Basic functional and block performance profiling (delay per function,
total functions, total time in block)

Highlight bottlenecks, points of contention
Break down computational stalls (block for I/O, shared-memory access,
data dependencies, etc)
Communication delay on interconnect (SCI, GigE, IBA, QsNet, etc.)








Real-time profiling
Compiler independent
Network independent
Platform independent
11/21/03
7
Tool Evaluation Criteria – e.g. QFD Table
Feature
Weight
Independent thread profiling
Remote thread communication
Basic functional and block profiling
Breakdown of stalls
Real-time profiling
Communication delay on interconnect
Compiler independent
Network independent
Platform independent
Miscellaneous characteristics
11/21/03
8
Tools Overview

UPC tools


SHMEM tools



TotalView = show code correctness not performance!
CrayPat
Vampir and VampirTrace
General performance tools







11/21/03
Vampir and VampirTrace
PAPI
Perfometer
Kojak
SvPablo
Paradyn
TAU
9
UPC Tools: TotalView
Overview




Developed by Etnus, LLC
Version 6.3 (supported UPC since version 6.1)
Debugger only, no performance analysis
Supports UPC, SHMEM, C/C++, Fortran, MPI,
OpenMP, and others

Supports UPC on SGI IRIX, HP Tru64
Features




Commercial product
Tests code modifications without recompilation
Supports independent thread debugging
Shared variable views



On each thread
Altogether (e.g. arrays)
Other basic features



Breakpoints
Memory debugging
Reliable handling of complex code
Desired Enhancements

Performance profiling of UPC and SHMEM


Basic statistics gathering a starting point
Minimize reduction in performance
11/21/03
10
SHMEM Tools: CrayPat
Overview





Developed by CRAY
Performance analysis and tracing tool for Cray X1
Only works for Cray systems, not currently portable
Provides run-time analysis and profiling of program performance
Supports Fortran, MPI, MPI2, Pthreads, SHMEM, and UPC
Features








Provided with Cray systems
At cost of added complexity can provide extreme levels of detail
Needs rebuilding of application with CrayPat instrumentation code and libraries
Replacement for Cray SV1 performance analysis and profiling tools
Provides direct access to read hardware performance counters
Allows user to aggregate, display, format, and export collected performance data in various different ways
Provides I/O performance profiling for Fortran, asynchronous, and system call routines
Command line interface
Desired Enhancements


Support for architectures other than Cray systems
User-friendly GUI
11/21/03
11
Performance Tools: Vampirtrace
Overview





Developed by Pallas
Version 4.0
Supports all platforms that use GNU Compiler Collection
to compiler C or Fortran code
Supports Java, C, and Fortran
Supports MPI, Global Array programming model and
SHMEM
Features






Commercial product
Vampirtrace = generates program trace
Vampir = GUI used to analyze trace
Supports multithreaded MPI programs
Link Vampirtrace library during compilation
Can also record arbitrary user-defined events



Entry and exits from subroutines
Execution of code blocks
Filtering mechanism to focus on user-defined events and
statistics
Desired Enhancements

Support UPC programming model
11/21/03
12
Performance Tools: PAPI
PAPI Software Interface
Overview




Developed at Innovative Computing Laboratory at U. of Tennessee, Knoxville
Version 2.3.4.2 released May 2003
Monitors computation events using hardware counters available on modern
processors
Available for Windows, Linux, UNIX platforms
Portable Platform Independent Layer
Trace/Profiling Tool
Features


Open source, free download of full version
Consists of two layers of software

Portable Platform Independent Layer — API

Platform Specific Layer — Interface substrate that allows API to
communicate with hardware counters via patched kernel, operating system,
or directly




Linux systems must have kernel patched with perfctr tool to allow
access to hardware counters
Provides two interfaces

High-level interface for simple measurements and purposes

Low-level interface for more complex and sophisticated purposes
Many tools feature optional support for PAPI

Additional features available when tool is configured with PAPI support

SvPablo, Perfometer, Visual Profiler, among others
Example metrics: L1 data cache misses, cache line invalidation, floating-point
stalls, instructions per second
Desired Enhancements

Addition of new PAPI metrics that reflect key issues directly relating to
UPC/SHMEM
11/21/03
High-Level API
Low-Level API
Machine Dependent Substrate
OR
OR
Patched Kernel
Operating System
Hardware Performance Counters
Platform Specific Layer
13
Performance Tools: Perfometer
Overview






Developed at U. of Tennessee, Knoxville
Version 1.1 released September 12, 2002
Works with any system with PAPI support
Requires Java for GUI
Provides run-time visualization of program performance
Supports C/C++ programs and has MPI support
Features


Open source, free download of full version
Monitors both local and remote applications






GUI and backend communicate through ports
Returns information on processor and executables for
each application
Has alarms that pause program when data monitoring
thresholds are reached
Able to pause and continue program execution
Requires perfometer() call inserted in program to enable
monitoring
Mark_perfometer() call allows user to change color of
graph to see trends of different sections of code
Desired Enhancements

Support UPC and SHMEM programming models
11/21/03
14
Performance Tools: Kojak
Overview




Collaborative research project of U. of Tennessee, Knoxville and
Research Centre Juelich (Germany)
Version .99 released Nov. 4 2003 (3rd release)
Available for Linux IA-32, IBM Power3/Power4, SGI Mips, IA-64,
SUN SPARC
Supports MPI 1.2 and OpenMP, as well as uniprocessor
applications
Features


Open source, free download of full version
No modifications to source code needed



EPILOG trace file generated at program run-time




OPARI tool (also part of TAU) provides automatic instrumentation
Custom modifications can also be conducted to permit closer
examination of arbitrary function calls
Open trace file format
Support for conversion to VAMPIR format for analysis with VAMPIR
tools
EXPERT module provides automatic analysis of EPILOG trace
files
Tool is geared towards identifying performance problems

Range of problems known to EXPERT is flexible and extendable
Desired Enhancements

Support UPC and SHMEM programming models
EXPERT Pre-Defined Monitored Properties
11/21/03
15
Performance Tools: SvPablo
Overview





Developed at U. of Illinois, Urbana-Champaign
Version 5.2 released March 2003
Tool to help developers “tune” their software for better
performance and help them eliminate bottlenecks
Available for Sun Solaris, SGI IRIX, IBM SP, Compaq Alpha
and Linux
Supports C, Fortran 77/90, HPF, MPI and OpenMP
Features




Open source, free download of full version
Interactive instrumentation of code via GUI
Link SvPablo library during compilation
Provides performance data



Traces loops and function calls
But does not trace all instructions
Provides statistical data




Counts how many times a function was executed
Records execution time of function
Correlates performance data with source code
PAPI support
Desired Enhancements

Support UPC and SHMEM programming models
11/21/03
16
Performance Tools: Paradyn
Overview





Developed by U. of Wisconsin, Madison
Version 4.0 released May 31, 2003
Visuals include time-plots, bar graphs, and tables
Available for Solaris (SPARC), Linux (x86), Windows NT and 2000
(x86), and AIX (RS6000)
Supports Fortran, C/C++, Java, and MPI
Features






Open source, free download of full version
No modifications to source or binary
Dynamic instrumentation with real-time reporting
Can focus on specific portions of a program and on specific
performance parameters
Records many different performance statistics such as CPU time,
send/receive message count and sizes, sync time, and IO time
Performance Consultant executes automated performance
bottleneck search


Hypothesizes main bottleneck of program or chunks of program
Bottlenecks classified as CPUbound, ExcessiveSyncWaitingTime,
ExcessiveIOBlockingTime, TooManySmallIOOps
Desired Enhancements

11/21/03
Support UPC and SHMEM programming models
17
Performance Tools: TAU
Overview




Developed at U. of Oregon
TAU = Tuning and Analysis Utilities Portable Profiling
Package
Available for SGI, Origin 2K, IBM SP2, Cray T3E, Sun,
Windows 95/98/NT, Linux (x86)
Supports C/C++, Java, Fortran 77/90, HPF, HPC++, and
MPI
Features




Open source, free download of full version
Maintains performance data for each thread, context, and
node used in parallel, multi-threaded programs
Captures data for functions, basic blocks
Three methods of instrumentation
1.
2.
3.
Automatic via TAU Program Database Toolkit
Manually via TAU instrumentation API
Automatic at run-time via tau_run instrumentor




DyninstAPI dynamic instrumentation package
Racy = GUI analyzer used to find bottlenecks
PAPI support
Fast, reliable support
Desired Enhancements

Support UPC and SHMEM programming models
11/21/03
18
Conclusions and Future Plans

Few tools support UPC or SHMEM programming models




Many performance analysis tools for message-passing programs
We must bridge this gap





TotalView does not analyze performance
CrayPat is not a portable tool
Bring performance analysis to UPC and SHMEM tools
Bring UPC and SHMEM support to performance tools
Determine most feasible approach and pursue
Focus on key issues at multiple levels; language, mapping, architecture
Projected milestones/deliverables in proposed two-year project
Year 1

Comprehensive survey and evaluation of HPC performance tools

Investigation of key performance attributes in UPC and SHMEM

Investigation of key performance attributes in existing/emerging system architectures

Refinement of evaluation criteria and QFD table to identify primary approach
Year 2

Development of prototype performance tools for HPC

Performance benchmarking and optimization on selected system architectures

Investigation of usability and productivity achieved with these tools
11/21/03
19
Download