Performance Instrumentation and Measurement for Terascale Systems

advertisement
Performance Instrumentation and
Measurement for Terascale Systems
Jack Dongarra, Shirley Moore, Philip Mucci
University of Tennessee
Sameer Shende, and Allen Malony
University of Oregon
June 2, 2003
ICCS 2003
1
Requirements for Terascale Systems
• Performance framework must support a wide
range of
– Performance problems (e.g., single-node performance,
synchronization and communication overhead, load
balancing)
– Performance evaluation methods (e.g., parameter-based
modeling, bottleneck detection and diagnosis)
– Programming environments (e.g., multiprocess and /or
multithreaded, parallel and distributed, large-scale)
• Need for flexible and extensible performance
observation framework
June 2, 2003
ICCS 2003
2
Research Problems
• Appropriate level and location for
implementing instrumentation and
measurement
• How to make the framework modular and
extensible
• Appropriate compromise between level of
detail/accuracy and instrumentation cost
June 2, 2003
ICCS 2003
3
Instrumentation Strategies
• Source code instrumentation
– Manual or using preprocessor
• Library level instrumentation
– e.g., MPI and OpenMP profiling interfaces
• Binary rewriting
– E.g., Pixie, ATOM, EEL, PAT
• Dynamic instrumentation
– DyninstAPI
June 2, 2003
ICCS 2003
4
Types of Measurements
• Profiling
• Tracing
• Real-time Analysis
June 2, 2003
ICCS 2003
5
Profiling
• Recording of summary information during
execution
– inclusive, exclusive time, # calls, hardware statistics, …
• Reflects performance behavior of program entities
– functions, loops, basic blocks
– user-defined “semantic” entities
• Very good for low-cost performance assessment
• Helps to expose performance bottlenecks and
hotspots
• Implemented through
– sampling: periodic OS interrupts or hardware counter
traps
– instrumentation: direct insertion of measurement code
June 2, 2003
ICCS 2003
6
Tracing
– Recording of information about significant points
(events) during program execution
• entering/exiting code region (function, loop, block, …)
• thread/process interactions (e.g., send/receive message)
– Save information in event record
• timestamp
• CPU identifier, thread identifier
• Event type and event-specific information
– Event trace is a time-sequenced stream of event
records
– Can be used to reconstruct dynamic program behavior
– Typically requires code instrumentation
June 2, 2003
ICCS 2003
7
Real-time Analysis
• Allows evaluation of program performance
during execution
• Examples
– Paradyn
– Autopilot
– Perfometer
June 2, 2003
ICCS 2003
8
TAU Performance System Architecture
Paraver
June 2, 2003
ICCS 2003
EPILOG
9
TAU Instrumentation
• Manually using TAU instrumentation API
• Automatically using
– Program Database Toolkit (PDT)
– MPI profiling library
– Opari OpenMP rewriting tool
• Uses PAPI to access hardware counter data
June 2, 2003
ICCS 2003
10
Program Database Toolkit (PDT)
• Program code analysis framework for developing
source-based tools
• High-level interface to source code information
• Integrated toolkit for source code parsing,
database creation, and database query
– commercial grade front end parsers
– portable IL analyzer, database format, and access API
– open software approach for tool development
• Targets and integrates multiple source languages
• Used in TAU to build automated performance
instrumentation tools
June 2, 2003
ICCS 2003
11
PDT Components
• Language front end
– Edison Design Group (EDG): C, C++
– Mutek Solutions Ltd.: F77, F90
– creates an intermediate-language (IL) tree
• IL Analyzer
– processes the intermediate language (IL) tree
– creates “program database” (PDB) formatted file
• DUCTAPE (Bernd Mohr, ZAM, Germany)
– C++ program Database Utilities and Conversion Tools
APplication Environment
– processes and merges PDB files
– C++ library to access the PDB for PDT applications
June 2, 2003
ICCS 2003
12
TAU Analysis
• Profile analysis
– pprof
• parallel profiler with text-based display
– Racy / jRacy
• graphical interface to pprof (Tcl/Tk)
• jRacy is a Java implementation of Racy
– ParaProf
• Next-generation parallel profile analysis and display
• Trace analysis and visualization
–
–
–
–
Trace merging and clock adjustment (if necessary)
Trace format conversion (ALOG, SDDF, Vampir)
Vampir (Pallas) trace visualization
Paraver (CEPBA) trace visualization
June 2, 2003
ICCS 2003
13
TAU Pprof Display
June 2, 2003
ICCS 2003
14
jracy (NAS Parallel Benchmark – LU)
Global profiles
Routine
profile across
all nodes
n: node
c: context
t: thread
Individual profile
June 2, 2003
ICCS 2003
15
ParaProf Scalable Profiler
• Re-implementation of jRacy tool
• Target flexibility in profile input source
– Profile files, performance database, online
• Target scalability in profile size and display
– Will include three-dimensional display support
• Provide more robust analysis and extension
– Derived performance statistics
June 2, 2003
ICCS 2003
16
ParaProf Architecture
June 2, 2003
ICCS 2003
17
512-Processor Profile (SAMRAI)
June 2, 2003
ICCS 2003
18
Three-dimensional Profile Displays
500-processor Uintah execution (University of Utah)
June 2, 2003
ICCS 2003
19
Overview of PAPI
• Performance Application Programming Interface
• The purpose of the PAPI project is to design, standardize
and implement a portable and efficient API to access the
hardware performance monitor counters found on most
modern microprocessors.
• Parallel Tools Consortium project
• References implementations for all major HPC platforms
• Installed and in use at major government labs, academic
sites
• Becoming de facto industry standard
• Incorporated into many performance analysis tools – e.g.,
HPCView,SvPablo, TAU, Vampir, Vprof
June 2, 2003
ICCS 2003
20
PAPI Counter Interfaces
• PAPI provides three interfaces to the
underlying counter hardware:
1. The low level interface provides functions for
setting options, accessing native events, callback
on counter overflow, etc.
2. The high level interface simply provides the
ability to start, stop and read the counters for a
specified list of events.
3. Graphical tools to visualize information.
June 2, 2003
ICCS 2003
21
PAPI Implementation
Tools
Portable
PAPI Low Level
Layer
Machine
Specific
Layer
PAPI High Level
PAPI Machine
Dependent Substrate
Kernel Extension
Operating System
Hardware Performance Counter
June 2, 2003
ICCS 2003
22
PAPI Preset Events
• Proposed standard set of events deemed
most relevant for application performance
tuning
• Defined in papiStdEventDefs.h
• Mapped to native events on a given
platform
– Run tests/avail to see list of PAPI preset events
available on a platform
June 2, 2003
ICCS 2003
23
Scalability of PAPI Instrumentation
• Overhead of library calls to read counters can be
excessive.
• Statistical sampling can reduce overhead.
• PAPI substrate for Alpha Tru64 UNIX
– Built on top of DADD/DCPI (Dynamic Access to DCPI
Data/Digital Continuous Profiling Interface)
– Sampling approach supported in hardware
– 1-2% overhead compared to 30% on other platforms
• Using sampling and hardware profiling support on
Itanium/Itanium2
June 2, 2003
ICCS 2003
24
Vampir v3.x: Hardware Counter Data
• Counter Timeline Display
June 2, 2003
ICCS 2003
25
What is DynaProf?
• A portable tool to instrument a running
executable with Probes that monitor
application performance.
• Simple command line interface.
• Open Source Software
• A work in progress…
June 2, 2003
ICCS 2003
26
DynaProf Methodology
• Make collection of run-time performance
data easy by:
–
–
–
–
–
Avoiding instrumentation and recompilation
Using the same tool with different probes
Providing useful and meaningful probe data
Providing different kinds of probes
Allowing custom probes
June 2, 2003
ICCS 2003
27
Why the “Dyna”?
• Instrumentation is selectively inserted
directly into the program’s address space.
• Why is this a better way?
– No perturbation of compiler optimizations
– Complete language independence
– Multiple Insert/Remove instrumentation cycles
June 2, 2003
ICCS 2003
28
DynaProf Design
• GUI, command line & script driven user
interface
• Uses GNU readline for command line
editing and command completion.
• Instrumentation is done using:
– Dyninst on Linux, Solaris and IRIX
– DPCL on AIX
June 2, 2003
ICCS 2003
29
DynaProf Commands
load <executable>
list [module pattern]
use <probe> [probe args]
instr module <module> [probe args]
instr function <module> <function> [probe
args]
stop
continue
run [args]
Info
unload
June 2, 2003
ICCS 2003
30
DynaProf Probe Design
• Probes provided with distribution
– Wallclock probe
– PAPI probe
– Perfometer probe
• Can be written in any compiled language
• Probes export 3 functions with a standardized
interface.
• Easy to roll your own (<1day)
• Supports separate probes for
MPI/OpenMP/Pthreads
June 2, 2003
ICCS 2003
31
Future development
• GUI development
• Additional probes
– Perfex probe
– Vprof probe
– TAU probe
• Better support for parallel applications
June 2, 2003
ICCS 2003
32
Perfometer
• Application is instrumented with PAPI
– call perfometer()
– call mark_perfometer(int color, char *label)
• Application is started. At the call to perfometer, signal
handler and a timer are set up to collect and send the
information to a Java applet containing the graphical view.
• Sections of code that are of interest can be designated with
specific colors
• Real-time display or trace file
June 2, 2003
ICCS 2003
33
Perfometer Display
Machine info
Flop/s Rate
Flop/s Min/Max
Process &
Real time
June 2, 2003
ICCS 2003
34
Perfometer Parallel Interface
June 2, 2003
ICCS 2003
35
Conclusions
• TAU and PAPI projects are addressing important
research problems involved in constructing a
flexible and extensible performance observation
framework.
• Widespread adoption of PAPI demonstrates the
value of a portable interface to low-level
architecture-specific performance monitoring
hardware.
• TAU framework provides flexible mechanisms for
instrumentation and measurement.
June 2, 2003
ICCS 2003
36
Conclusions (cont.)
• Terascale systems require scalable low-overhead
means of collecting performance data.
– Statistical sampling support in PAPI
– TAU filtering and feedback schemes for focusing
instrumentation
– Real-time monitoring capabilities (Dynaprof,
Perfometer)
• PAPI and TAU infrastructure is designed for
interoperability, flexibility, and extensibility.
June 2, 2003
ICCS 2003
37
More Information
•
•
•
•
TAU (http://www.acl.lanl.gov/tau)
PDT (http://www.acl.lanl.gov/pdtoolkit)
PAPI (http://icl.cs.utk.edu/papi/)
OPARI (http://www.fz-juelich.de/zam
June 2, 2003
ICCS 2003
38
Download