Profiling Tools in Ranger - UCSB College of Engineering

advertisement
Profiling Tools In Ranger
Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma
carlos@tacc.utexas.edu
mpiP and IPM
SCALABILITY IN MPI APPLICATIONS
About mpiP
mpiP is an MPI Profiling library
mpip.sourceforge.net
• Scalable & Lightweight
• Multiplatform
– Linux IA32/IA64/x86_64/MPIS64
– IBM POWER 4/5
– Cray XT3/XT4/X1E
• Does not require manual code instrumentation
– collects statistics of MPI functions (wraps original MPI function calls)
– less overhead than tracing tools
– less data than tracing tools
• Easy to use (requires linking but not compilation)
Using mpiP
Load the mpiP module:
% module load mpiP
Link the static library before any others:
% mpicc -g -L$TACC_MPIP_LIB -lmpiP -lbfd -liberty ./srcFile.c
Set environmental variables controlling the mpiP output:
% setenv MPIP ‘-t 10 -k 2’
In this case:
-t 10  only callsites with time > 10% MPI time included in report
-k 2  set callsite stack traceback depth to 2
Run program through the queue as usual.
mpiP runtime options
option
description
-c
generate concise report no callsite process-specific detail
-k n
set callsite stack traceback depth to <n>
-o
disable profiling at initialization
-t x
set print threshold, <x> is MPI % of time for each callsite
-v
generate both concise and verbose reports
-x exe
specify the full path to the executable
(csh)% setenv MPIP ‘option1 option2 …’
(bash)% export MPIP=‘option1 option2 …’
default
1
0.0
mpiP calls from C/Fortran
Generate arbitrary reports using the function call
MPI_Pcontrol() with different arguments
Argument
Output behavior
0
Disable profiling (default)
1
Enable profiling
2
Reset all callsite data
3
Generate verbose report
4
Generate concise report
Useful to:
– profile specific sections of the code
– obtain individual profiles of multiple function calls
MPI_Pcontrol examples
Scope limitation
Individual reports
switch(i) {
case 5:
MPI_Pcontrol(1);// enable profiling
break;
case 6:
MPI_Pcontrol(0);// disable profiling
break;
default:
break;
}
/* ... do something for one timestep ... */
switch(i) {
case 5:
MPI_Pcontrol(2); // reset profile data
MPI_Pcontrol(1); // enable profiling
break;
case 6:
MPI_Pcontrol(3); // generate verbose report
MPI_Pcontrol(4); // generate concise report
MPI_Pcontrol(0); // disable profiling
break;
default:
break;
}
/* ... do something for one timestep ... */
mpiP output
After running the executable a file with the extension .mpiP will be
generated with:
– MPI Time (MPI time for all MPI calls)
– MPI callsites
– Aggregate message size
– Aggregate time
For scalability analysis it is important to compare the total MPI time to
the total running time of the application.
Detailed function call data can be used to identify communication
hotspots.
mpiP output: MPI Time
--------------------------------------------------------------------------@--- MPI Time (seconds) ----------------------------------------------------------------------------------------------------------------------------Task AppTime MPITime MPI%
0
4.83
4.4 91.12
1
4.83
0.332 6.87
2
4.83
0.31 6.42
3
4.83
0.316 6.54
4
4.83
0.328 6.79
5
4.83
0.34 7.04
6
4.83
0.33 6.84
7
4.83
0.342 7.08
8
4.83
0.324 6.71
9
4.83
0.323 6.69
10
4.83
0.346 7.17
11
4.83
0.328 6.79
12
4.83
0.341 7.06
13
4.83
0.32 6.63
14
4.83
0.345 7.15
15
4.83
0.344 7.13
*
77.2
9.36 12.13
mpiP output: MPI Time
--------------------------------------------------------------------------@--- MPI Time (seconds) ----------------------------------------------------------------------------------------------------------------------------Task AppTime MPITime MPI%
0
4.83
4.4 91.12
1
4.83
0.332 6.87
This process seems to be
2
4.83
0.31 6.42
3
4.83
0.316 6.54
controlling all MPI exchanges
4
4.83
0.328 6.79
5
4.83
0.34 7.04
6
4.83
0.33 6.84
7
4.83
0.342 7.08
8
4.83
0.324 6.71
9
4.83
0.323 6.69
10
4.83
0.346 7.17
11
4.83
0.328 6.79
12
4.83
0.341 7.06
13
4.83
0.32 6.63
14
4.83
0.345 7.15
15
4.83
0.344 7.13
*
77.2
9.36 12.13
mpiP output: MPI callsites
---------------------------------------------------------------------@--- Callsites: 9 ------------------------------------------------------------------------------------------------------------------------ID Lev File/Address
Line Parent_Funct
MPI_Call
1 0 matmultc.c
60 main
Send
2 0 matmultc.c
52 main
Bcast
3 0 matmultc.c
103 main
Barrier
4 0 matmultc.c
78 main
Send
5 0 matmultc.c
65 main
Recv
6 0 matmultc.c
74 main
Send
7 0 matmultc.c
98 main
Send
8 0 matmultc.c
92 main
Recv
9 0 matmultc.c
88 main
Bcast
mpiP output: Aggregate time
---------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------Call
Site
Time App% MPI% COV
Recv
5 4.19e+03 5.42 44.70 0.00
Bcast
9 4.12e+03 5.34 44.03 0.00
Recv
8
412 0.53 4.40 0.11
Barrier
3
229 0.30 2.45 0.79
Send
7
203 0.26 2.17 0.44
Bcast
2
174 0.23 1.86 0.00
Send
6
37 0.05 0.39 0.00
Send
1
0.473 0.00 0.01 0.00
Send
4
0.023 0.00 0.00 0.00
mpiP output: Aggregate time
---------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------Call
Site
Time App% MPI% COV
Recv
5 4.19e+03 5.42 44.70 0.00
Bcast
9 4.12e+03 5.34 44.03 0.00
Recv
8
412 0.53 4.40 0.11
Barrier
3
229 0.30 2.45 0.79
Send
7
203 0.26 2.17 0.44
Bcast
2
174 0.23 1.86 0.00
Send
6
37 0.05 0.39 0.00
Send
1
0.473 0.00 0.01 0.00
Send
4
0.023 0.00 0.00 0.00
mpiP output: Message Size
---------------------------------------------------------------------@--- Aggregate Sent Message Size (top twenty, descending, bytes) -------------------------------------------------------------------------Call
Site
Count
Total
Avrg Sent%
Bcast
9
30000 4.8e+08 1.6e+04 83.33
Send
7
2000 3.2e+07 1.6e+04 5.56
Bcast
2
2000 3.2e+07 1.6e+04 5.56
Send
6
1985 3.18e+07 1.6e+04 5.51
Send
1
15 2.4e+05 1.6e+04 0.04
Send
4
15
120
8 0.00
About IPM
IPM is an Integrated Performance Monitoring tool
http://ipm-hpc.sourceforge.net/
•
•
•
•
Portable profiling infrastructure for parallel codes
Low-overhead performance summary of computing and communication
IPM is a quick, easy and concise profiling tool
Requires no manual instrumentation, just adding the -g option to the
compilation
• Produces XML output that is parsed by scripts to generate browserreadable html pages
• The level of detail it reports is lower than TAU, PAPI, HPCToolkit or Scalasca
but higher than mpiP
Using IPM
• Available on Ranger for both intel and pgi compilers, with mvapich
and mvapich2
• Create ipm environment with module command before running
code:
% module load ipm
• In your job script, set up the following ipm environment before the
ibrun command:
module load ipm
export LD_PRELOAD=$TACC_IPM_LIB/libipm.so
export IPM_REPORT=full
ibrun <my executable> <my arguments>
Using IPM
•
export LD_PRELOAD=$TACC_IPM_LIB/libipm.so
– must be inside job script to ensure the IPM wrappers for MPI calls are loaded
properly
• IPM_REPORT: controls the level of information collected
– full
– terse
– none
• IPM_MPI_THRESHOLD: Reports only routines using this percentage (or
more) of MPI time.
– e.g. “IPM_MPI_THRESHOLD 0.3” report subroutines that consume more than
30% of the total MPI time.
• Important details:
% module help ipm
Output from IPM
• When your code has finished running IPM will create an XML file with a
name like:
username.1231369287.321103.0
• Get basic or full information in text mode using:
% ipm_parse username.1231369287.321103.0
% ipm_parse -full username.1231369287.321103.0
• You can also transform this XML file into standard HTML files :
% ipm_parse -html username.1231369287.321103.0
• This generates a directory which contains an index.html file readable by
any web browser
• Tar this directory and scp the file to your own local computer to visualize
the results
IPM: Text Output
##IPMv0.922###############################################
#
# command : /work/01125/yye00/ICAC/cactus_SandTank SandTank.par
# host : i101-309/x86_64_Linux mpi_tasks : 32 on 2 nodes
# start : 05/26/09/11:49:06
wallclock : 2.758892 sec
# stop : 05/26/09/11:49:09
%comm : 2.01
# gbytes : 4.38747e+00 total
gflop/sec : 9.39108e-02 total
#
##########################################################
# region : *
[ntasks] =
32
#
#
[total]
<avg>
min
max
# entries
32
1
1
1
# wallclock
88.2742
2.75857
2.75816
2.75889
# user
5.51634
0.172386
0.148009
0.200012
# system
1.771
0.0553438 0.0536683 0.056717
# %comm
2.00602
1.94539
2.05615
# gflop/sec
0.0939108 0.00293471 0.00293338 0.002952
# gbytes
4.38747
0.137109
0.136581
0.144985
#
# PAPI_FP_OPS
2.5909e+08 8.09655e+06 8.09289e+06 8.14685e+06
# PAPI_TOT_CYC
6.80291e+09 2.12591e+08 2.02236e+08 2.19109e+08
# PAPI_VEC_INS
5.95596e+08 1.86124e+07 1.85964e+07 1.8756e+07
# PAPI_TOT_INS
4.16377e+09 1.30118e+08 1.0987e+08 1.35676e+08
#
#
[time]
[calls] <%mpi> <%wall>
# MPI_Allreduce
0.978938
53248 55.28
1.11
# MPI_Comm_rank
0.316355
6002 17.86
0.36
# MPI_Barrier
0.247135
3680 13.95
0.28
# MPI_Allgatherv 0.16621
2848
9.39
0.19
# MPI_Bcast
0.0217298
576
1.23
0.02
# MPI_Allgather
0.0216982
672
1.23
0.02
# MPI_Recv
0.0186796
32
1.05
0.02
# MPI_Comm_size
0.000139921 2112
0.01
0.00
# MPI_Send
0.000115622
32
0.01
0.00
###########################################################
IPM: HTML Output
IPM: Load Balance
IPM: Communication balancing, by task
IPM: Integrated Performance Monitoring
IPM: Connectivity, Buffer-size Distribution
IPM: Buffer-Size Distribution
IPM: Memory Usage
timers, gprof
BASIC PROFILING TOOLS
Timers: Command Line
•
•
•
The command time is available in most Unix systems.
It is simple to use (no code instrumentation required).
Gives total execution time of a process and all its children in seconds.
% /usr/bin/time -p ./exeFile
real 9.95
user 9.86
sys 0.06
Leave out the -p option to get additional information:
% time ./exeFile
% 9.860u 0.060s 0:09.95 99.9% 0+0k 0+0io 0pf+0w
Timers: Code Section
INTEGER :: rate, start, stop
REAL :: time
CALL SYSTEM_CLOCK(COUNT_RATE = rate)
CALL SYSTEM_CLOCK(COUNT = start)
! Code to time here
CALL SYSTEM_CLOCK(COUNT = stop)
time = REAL( ( stop - start )/ rate )
#include <time.h>
double start, stop, time;
start = (double)clock()/CLOCKS_PER_SEC;
/* Code to time here */
stop = (double)clock()/CLOCKS_PER_SEC;
time = stop - start;
About GPROF
GPROF is the GNU Project PROFiler.
gnu.org/software/binutils/
•Requires recompilation of the code.
•Compiler options and libraries provide wrappers for each routine call
and periodic sampling of the program.
•A default gmon.out file is produced with the function call information.
•GPROF links the symbol list in the executable with the data in
gmon.out.
Types of Profiles
• Flat Profile
– CPU time spend in each function (self and cumulative)
– Number of times a function is called
– Useful to identify most expensive routines
• Call Graph
–
–
–
–
Number of times a function was called by other functions
Number of times a function called other functions
Useful to identify function relations
Suggests places where function calls could be eliminated
• Annotated Source
– Indicates number of times a line was executed
Profiling with gprof
Use the -pg flag during compilation:
% gcc -g -pg ./srcFile.c
% icc -g -p ./srcFile.c
% pgcc -g -pg ./srcFile.c
Run the executable. An output file gmon.out will be generated with the profiling
information.
Execute gprof and redirect the output to a file:
% gprof ./exeFile gmon.out > profile.txt
% gprof –l ./exeFile gmon.out > profile_line.txt
% gprof -A ./exeFile gmon.out > profile_anotated.txt
Flat profile
In the flat profile we can identify the most expensive parts of the code (in this case, the calls to
matSqrt, matCube, and sysCube).
%
cumulative
time
seconds
50.00
2.47
24.70
3.69
24.70
4.91
0.61
4.94
0.00
4.94
0.00
4.94
0.00
4.94
self
seconds
2.47
1.22
1.22
0.03
0.00
0.00
0.00
calls
2
1
1
1
2
1
1
self
s/call
1.24
1.22
1.22
0.03
0.00
0.00
0.00
total
s/call
1.24
1.22
1.22
4.94
0.00
1.24
0.00
name
matSqrt
matCube
sysCube
main
vecSqrt
sysSqrt
vecCube
Call Graph Profile
index % time
self children
called
name
0.00
0.00
1/1
<hicore> (8)
[1]
100.0
0.03
4.91
1
main [1]
0.00
1.24
1/1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
1.22
0.00
1/1
sysCube [5]
1.22
0.00
1/1
matCube [4]
0.00
0.00
1/2
vecSqrt [6]
0.00
0.00
1/1
vecCube [7]
----------------------------------------------1.24
0.00
1/2
main [1]
1.24
0.00
1/2
sysSqrt [3]
[2]
50.0
2.47
0.00
2
matSqrt [2]
----------------------------------------------0.00
1.24
1/1
main [1]
[3]
25.0
0.00
1.24
1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
0.00
0.00
1/2
vecSqrt [6]
-----------------------------------------------
Visual Call Graph
main
sysSqrt
matSqrt
vecSqrt
matCube
vecCube
sysCube
Call Graph Profile
index % time
self children
called
name
0.00
0.00
1/1
<hicore> (8)
[1]
100.0
0.03
4.91
1
main [1]
0.00
1.24
1/1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
1.22
0.00
1/1
sysCube [5]
1.22
0.00
1/1
matCube [4]
0.00
0.00
1/2
vecSqrt [6]
0.00
0.00
1/1
vecCube [7]
----------------------------------------------1.24
0.00
1/2
main [1]
1.24
0.00
1/2
sysSqrt [3]
[2]
50.0
2.47
0.00
2
matSqrt [2]
----------------------------------------------0.00
1.24
1/1
main [1]
[3]
25.0
0.00
1.24
1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
0.00
0.00
1/2
vecSqrt [6]
-----------------------------------------------
Visual Call Graph
main
sysSqrt
vecSqrt
matSqrt
matCube
vecCube
sysCube
Call Graph Profile
index % time
self children
called
name
0.00
0.00
1/1
<hicore> (8)
[1]
100.0
0.03
4.91
1
main [1]
0.00
1.24
1/1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
1.22
0.00
1/1
sysCube [5]
1.22
0.00
1/1
matCube [4]
0.00
0.00
1/2
vecSqrt [6]
0.00
0.00
1/1
vecCube [7]
----------------------------------------------1.24
0.00
1/2
main [1]
1.24
0.00
1/2
sysSqrt [3]
[2]
50.0
2.47
0.00
2
matSqrt [2]
----------------------------------------------0.00
1.24
1/1
main [1]
[3]
25.0
0.00
1.24
1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
0.00
0.00
1/2
vecSqrt [6]
-----------------------------------------------
Visual Call Graph
main
sysSqrt
matCube
matSqrt
vecSqrt
vecCube
sysCube
PerfExpert, Tau
ADVANCED PROFILING TOOLS
Advanced Profiling Tools
• Can be intimidating:
– Difficult to install
– Many dependences
– Require kernel patches
Not your problem!!
• Useful for serial and parallel programs
• Extensive profiling and scalability information
• Analyze code using:
– Timers
– Hardware registers (PAPI)
– Function wrappers
PAPI
PAPI is a Performance Application Programming Interface
icl.cs.utk.edu/papi
• API to use hardware counters
• Behind Tau, HPCToolkit
• Multiplatform:
–
–
–
–
–
–
–
Most Intel & AMD chips
IBM POWER 4/5/6
Cray X/XD/XT
Sun UltraSparc I/II/III
MIPS
SiCortex
Cell
• Available as a module in Ranger
PAPI: Available Events
Counter/Event Meaning
Name
Counter/Event Meaning
Name
PAPI_L1_DCM
Level 1 data cache misses
PAPI_VEC_INS
Vector/SIMD instructions
PAPI_L1_ICM
Level 1 instruction cache misses
PAPI_RES_STL
Cycles stalled on any resource
PAPI_L2_DCM
Level 2 data cache misses
PAPI_TOT_CYC
Total cycles
PAPI_L2_ICM
Level 2 instruction cache misses
PAPI_L1_DCA
Level 1 data cache accesses
PAPI_L2_TCM
Level 2 cache misses
PAPI_L2_DCA
Level 2 data cache accesses
PAPI_L2_ICH
Level 2 instruction cache hits
PAPI_L3_TCM
Level 3 cache misses
PAPI_L1_ICA
Level 1 instruction cache accesses
PAPI_FPU_IDL
Cycles floating point units are idle
PAPI_L2_ICA
Level 2 instruction cache accesses
PAPI_TLB_DM
PAPI_L1_ICR
Level 1 instruction cache reads
PAPI_TLB_IM
Data translation lookaside buffer misses
Instruction translation lookaside buffer
misses
PAPI_L2_TCA
Level 2 total cache accesses
PAPI_STL_ICY
Cycles with no instruction issue
PAPI_L3_TCR
Level 3 total cache reads
PAPI_HW_INT
Hardware interrupts
PAPI_FML_INS
PAPI_BR_TKN
PAPI_FAD_INS
PAPI_BR_MSP
Conditional branch instructions taken
Conditional branch instructions
mispredicted
PAPI_TOT_INS
Instructions completed
PAPI_FP_INS
Floating point instructions
PAPI_FSQ_INS
Floating point multiply instructions
Floating point add instructions (Also
includes subtract instructions)
Floating point divide instructions (Counts
both divide and square root instructions)
Floating point square root instructions
(Counts both divide and square root
instructions)
PAPI_BR_INS
Branch instructions
PAPI_FP_OPS
Floating point operations
PAPI_FDV_INS
About PerfExpert
• Brand new tool, locally developed at UT
• Easy to use and understand
• Great for quick profiling and for beginners
• Provides recommendation on “what to fix” in a subroutine
• Collects information from PAPI using HPCToolkit
• No MPI specific profiling, no 3D visualization, no elaborate
metrics
• Combines ease of use with useful interpretation of gathered
performance data
Using PerfExpert
• Load the papi and java modules:
% module load papi
% module load java
• Copy the PerfExpert.sge submission script (for editing):
cp /share/home/00976/burtsche/PerfExpert/PerfExpert.sge ./
• Edit the PerfExpert.sge script to ensure the correct executable
name, correct directory, correct project name and so on.
• Submit your job:
% qsub PerfExpert.sge
• To analyze results:
/share/home/00976/burtsche/PerfExpert/PerfExpert <threshold> ./hpctoolkit-….
Typical value for threshold is 0.1
About Tau
TAU is a suite of Tuning and Analysis Utilities
www.cs.uoregon.edu/research/tau
• 11+ year project involving
– University of Oregon Performance Research Lab
– LANL Advanced Computing Laboratory
– Research Centre Julich at ZAM, Germany
• Integrated toolkit
–
–
–
–
Performance instrumentation
Measurement
Analysis
Visualization
Using Tau
• Load the papi and tau modules
• Gather information for the profile run:
– Type of run (profiling/tracing, hardware counters, etc…)
– Programming Paradigm (MPI/OMP)
– Compiler (Intel/PGI/GCC…)
• Select the appropriate TAU_MAKEFILE based on your choices
($TAU/Makefile.*)
• Set up the selected PAPI counters in your submission script
• Run as usual & analyze using paraprof
– You can transfer the database to your own PC to do the analysis
TAU Performance System Architecture
Tau: Example
Load the papi and tau modules:
% module load papi
% module load tau
Say that we choose to do
–
–
–
–
a profiling run with multiple counters for a
MPI parallel code and use
the PDT instrumentator with
the PGI compiler
The TAU_MAKEFILE to use for this combination is:
$TAU/Makefile.tau-multiplecounters-mpi-papi-pdt-pgi
So we set it up:
% setenv TAU_MAKEFILE $TAU/Makefile.tau-multiplecounters-mpi-papi-pdt-pgi
And we compile using the wrapper provided by tau:
% tau_cc.sh matmult.c
Tau: Example (Cont.)
Next we decide which hardware counters to use:
–
–
–
GET_TIME_OF_DAY (time, profiling, similar to using gprof)
PAPI_FP_OPS
(Floating Point Operations Per Second)
PAPI_L1_DCM
(Data Cache Misses for the cache Level 1)
We set these as environmental variables in the command line or the submission script.
For csh:
% setenv COUNTER1 GET_TIME_OF_DAY
% setenv COUNTER2 PAPI_FP_OPS
% setenv COUNTER3 PAPI_L1_DCM
For bash:
% export COUNTER1 = GET_TIME_OF_DAY
% export COUNTER2 = PAPI_FP_OPS
% export COUNTER3 = PAPI_L1_DCM
The we send the job through the queue as usual.
Tau: Example (Cont.)
When the program finishes running one new directory will be created for each
hardware counter we specified:
–
–
–
MULTI__GET_TIME_OF_DAY
MULTI__PAPI_FP_OPS
MULTI__PAPI_L1_DCM
Analize the results with paraprof:
% paraprof
TAU: ParaProf Manager
Provides Machine Details
Organizes Runs as: Applications, Experiments and Trials.
Counters we asked for
Tau: Metric View
Profile Information is in “GET_TIME_OF_DAY” metric
Mean and Standard Deviation Statistics given.
Information includes
Mean and Standard
Deviation
Windows->Function
Legend
Tau: Metric View
Unstack the bars for clarity: Options -> Stack Bars Together
Tau: Function Data Window
Click on any of the bars corresponding to function multiply_matrices. This opens the
Function Data Window, which gives a closer look at a single function.
Tau: Float Point OPS
Hardware Counters provide Floating Point Operations (Function Data view).
Tau: L1 Cache Data Misses
Derived Metrics
Select Argument 1 (green ball); Select Argument 2 (green ball);
Select Operation; then Apply. Derived Metric will appear as a new trial.
Derived Metrics (Cont.)
• Select a Function
• Function Data Window -> Options -> Select Metric -> Exclusive -> …
Derived
Metrics
Since FP/Miss
ratios are
constant– must
be memory
access problem.
Be careful even though
ratios are constant, cores
may do different amounts
of work/operations per call.
Callpath
• To find out about function calls within the program, follow
the same process but using the following TAU_MAKEFILE:
Makefile.tau-callpath-mpi-pdt-pgi
• In the Metric View Window two new options will be
available under:
– Windows -> Thread -> Call Graph
– Windows -> Thread -> Call Path Relations
63
Callpath
Call Graph Paths (Must select through “thread” menu.)
Call Path
Profiling dos and don’ts
DO
DO NOT
• Test every change you
make
• Profile typical cases
• Compile with
optimization flags
• Test for scalability
• Assume a change will
be an improvement
• Profile atypical cases
• Profile ad infinitum
– Set yourself a goal or
– Set yourself a time limit
Other Profiling Tools In Ranger
• Scalasca
scalasca.org
– Scalable all-in-one profiling package.
– Requires re-compilation of source to instrument, like Tau.
– Accessible by loading the scalasca module:
– module load scalasca
• HPCToolkit
hpctoolkit.org
– All-in-one package similar to Tau and Scalasca but of more
recent development
– Uses binary instrumentation, so recompilation of the code
is not required
– Accessible via developer’s installation under:
/scratch/projects/hpctoolkit/pkgs/hpctoolkit/bin/
Download