p-1 - Spring School in Advanced Computing TACC @ UP

advertisement
Profiling
Kent Milfeld
TACC
May 28, 2009
Outline
•
•
•
•
Profiling MPI performance with MPIp
Tau
gprof
Example: matmult
MPIp
•
•
•
•
•
Scalable Profiling library for MPI apps
Lightweight
Collects statistics of MPI functions
Communicates only during report generation
Less data than tracing tools
http://mpip.sourceforge.net
Usage, Instrumentation, Analysis
• How to use
– No recompiling required!
– Profiling gathered in MPI profiling layer
• Link static library before default MPI libraries
-g -L${TACC_MPIP_LIB} -lmpiP -lbfd -liberty -lintl –lm
• What to analyze
– Overview of time spent in MPI communication during
the application run
– Aggregate time for individual MPI call
Control
• External Control: Set MPIP environment variable
(threshold, callsite depth)
– setenv MPIP ‘-t 10 –k 2’
– export MPIP= ‘-t 10 –k 2’
• Use MPI_Pcontrol(#) to limit profiling to specific code blocks
C
MPI_Pcontrol(2);
MPI_Pcontrol(1);
MPI_Abc(…);
MPI_Pcontrol(0);
F90
call MPI_Pcontrol(2)
call MPI_Pcontrol(1)
call MPI_Abc(…)
call MPI_Pcontrol(0)
Pcontrol Arg
Behavior
0
Disable profiling.
1
Enable Profiling.
2
Reset all callsite data.
3
Generate verbose report.
4
Generate concise report.
Example
• Untar the mpip tar file.
% tar -xvf ~train00/mpip.tar; cd mpip
• Read the “Instructions” file– setup Env. with sourceme-files
% source sourceme.csh (C-type shells)
% source sourceme.sh (Bash-type shells)
module purge
{Resets to default modules.}
module load TACC
module unload mvapich {Change compiler/MPI to intel/mvapich.}
module swap pgi intel
module load mvapich
module load mpiP
{Load up mpiP.}
set path=($path /share/home/00692/train00/mvapich1_intel/Mpipview_2.02/bin)
{Sets PATH with mpipview dir. – separate software from mpiP.)}
Example (cont.)
• Compile either the matmultc.c or matmultf.f90 code.
% make matmultf
Or
% make matmultc
• Uncomment “ibrun matmultc” or “ibrun matmultf” in the job file.
• Submit job.
% qsub job
• Results in:
<exec_name><unique number>.mpiP
Example *.mpiP Output
• MPI-Time: wall-clock time for all MPI calls
• MPI callsites
Output (cont.)
• Message size
• Aggregate time
mpipview
e.g. % mpipview matmultf.32.6675.1.mpiP
Note: list selection.
Indexed
Call Sites
Statistics
Message Sizes
Tau Outline
• General
– Measurements
– Instrumentation & Control
– Example: matmult
• Profiling and Tracing
– Event Tracing
– Steps for Performance Evaluation
– Tau Architecture
• A look at a task-parallel MxM Implementation
• Paraprof Interface
15
General
• Tuning and Analysis Utilities (11+ year project effort)
www.cs.uoregon.edu/research/paracomp/tau/
• Performance system framework for parallel, shared &
distributed memory systems
• Targets a general complex system computation model
– Nodes / Contexts / Threads
• Integrated toolkit for performance instrumentation,
measurement, analysis, and visualization
TAU = Profiler and Tracer + Hardware Counters + GUI + Database
16
Tau: Measurements
• Parallel profiling
–
–
–
–
–
–
Function-level, block (loop)-level, statement-level
Supports user-defined events
TAU parallel profile data stored during execution
Hardware counter values
Support for multiple counters
Support for callgraph and callpath profiling
• Tracing
– All profile-level events
– Inter-process communication events
– Trace merging and format conversion
17
Tau: Instrumentation
PDT is used to instrument your code.
Replace mpicc and mpif90 in make files with tau_f90.sh and tau_cc.sh
It is necessary to specify all the components that will be used in the
instrumentation (mpi, openmp, profiling, counters [PAPI], etc. However, these
come in a limited number of combinations.)
Combinations: First determine what you want to do (profiling, PAPI counters,
tracing, etc.) and the programming paradigm (mpi, openmp), and the compiler.
PDT is a required component:
Instrumentation
PDT
Hand-code
Parallel
Paradigm
MPI
OMP
…
Collectors
PAPI
Callpath
…
Compiler:
intel
pgi
gnu
18
Tau: Instrumentation
You can view the available combinations
(alias tauTypes 'ls -C1 $TAU | grep Makefile ' ).
Your selected combination is made known to the compiler wrapper through
the TAU_MAKEFILE environment variable.
E.g. the PDT instrumention (pdt) for the Intel compiler (icpc) for MPI (mpi)
is set with this command:
setenv TAU_MAKEFILE /…/Makefile.tau-icpc-mpi-pdt
Other run-time and instrumentation options are set through
TAU_OPTIONS. For verbose:
setenv TAU_OPTIONS ‘-optVerbose’
19
Tau Example
% tar –xvf ~train00/tau.tar
% cd tau
READ the Instructions file
% source sourceme.csh
or
% source sourceme.sh
create env. (modules and TAU_MAKEFILE)
% make matmultf
create executable(s)
or
% make matmultc
% qsub job
submit job (edit and uncomment ibrun line)
% paraprof
(for GUI) Analyze performance data:
20
Definitions – Profiling
• Profiling
– Recording of summary information during execution
• inclusive, exclusive time, # calls, hardware statistics, …
– Reflects performance behavior of program entities
• functions, loops, basic blocks
• user-defined “semantic” entities
– Very good for low-cost performance assessment
– Helps to expose performance bottlenecks and hotspots
– Implemented through
• sampling: periodic OS interrupts or hardware counter traps
• instrumentation: direct insertion of measurement code
21
Definitions – Tracing
Tracing
Recording of information about significant points (events)
during program execution
 entering/exiting code region (function, loop, block, …)
 thread/process interactions (e.g., send/receive message)
Save information in event record
 timestamp
 CPU identifier, thread identifier
 Event type and event-specific information
 Event trace is a time-sequenced stream of event records
 Can be used to reconstruct dynamic program behavior
Typically requires code instrumentation
22
Event Tracing: Instrumentation, Monitor, Trace
Event definition
CPU A:
void master {
trace(ENTER, 1);
...
trace(SEND, B);
send(B, tag, buf);
...
trace(EXIT, 1);
}
CPU B:
void worker {
trace(ENTER, 2);
...
recv(A, tag, buf);
trace(RECV, A);
...
trace(EXIT, 2);
}
1
master
2
worker
3
...
timestamp
...
MONITOR
58
A
ENTER
1
60
B
ENTER
2
62
A
SEND
B
64
A
EXIT
1
68
B
RECV
A
69
B
EXIT
2
...
23
Event Tracing: “Timeline” Visualization
1
master
2
worker
3
...
main
master
worker
...
58 A ENTER 1
60 B ENTER 2
62 A
SEND
B
64 A
EXIT
1
68 B
RECV
A
69 B
EXIT
2
...
A
B
58 60 62 64 66 68 70
24
Steps of Performance Evaluation
Collect basic routine-level timing profile to determine
where most time is being spent
Collect routine-level hardware counter data to determine
types of performance problems
Collect callpath profiles to determine sequence of events
causing performance problems
Conduct finer-grained profiling and/or tracing to pinpoint
performance bottlenecks
Loop-level profiling with hardware counters
Tracing of communication operations
25
TAU Performance System Architecture
26
Overview of Matmult: C = A x B
MASTER
Order N
P Tasks
C = A x B
Worker
Create
A&B
Send B
Receive B
Send
Row of A
Receive a
Multiply row a x B
Receive
Row of C
j
=
x
j
Send Back row of C
27
Preparation of Matmult: C = A x B
Order N
P Tasks
C = A x B
MASTER
Generate
A&B
PE 0
PE 0
Create
A
Create
B
PE 0  PE x
Broadcast
B to All
by columns
loop over i (i=1n)
MPI_Bcast( b(1,i)…
28
Master Ops of Matmult: C = A x B
C = A x B
Order N
P Tasks
MASTER
PE 0
Master (0) sends rows
1
1 through (p-1) to
slaves (1p-1) receives
p-1
PE 1 -- p-1
loop over i (i=1p-1)
MPI_Send(arow … i,i
destination
tag
PE 0
Master (0) receives rows
1
1 through (n) from
Slaves.
PE 1 -- p-1
loop over i (i=1n)
source,tag
MPI_Recv(crow … ANY,k
n
MPI_Send(arow …idle,j
dest,tag
29
Master Ops of Matmult: C = A x B
Order N
P Tasks
C = A x B
Worker
Pick up broadcast of
B columns from PE 0
loop over i (i=1n)
Slave receives any
A row from PE 0
Slaves multiply all
Columns of B into
A (row i) to form
row i of Matrix C
j
=
row j
Slave(any) sends row row j
j of C to master, PE 0
MPI_Recv( arow …ANY,j
x
j
PE 0
j
Matrix * Vector
MPI_Send( crow … j
30
•
•
a
n analyze performance data:
Execute application and
% qsub job
d
– Look for files: profile.<task_no>.
– With Multiple counters, look for directories for each
counter.
•
•
P
% pprof (for text based
p profile display)
% paraprof (for GUI)
r
– pprof and paraprof will discover files/directories.
o
– paraprof runs on PCs,Files/Directories
can be downloaded
to laptop and analyzed there.
f
31
Tau Paraprof Overview
Raw files
PerfDMF
managed
(database)
HPMToolkit
Metadata
MpiP
Application
Experiment
Trial
TAU
32
Tau Paraprof Manager Window
Provides Machine Details
Organizes Runs as: Applications, Experiments and Trials.
33
Routine Time Experiment
Profile Information is in “GET_TIME_OF_DAY” metric
Mean and Standard Deviation Statistics given.
34
Multiply_Matrices Routine Results
Function Data Window gives a closer look at a single function:
not from same run
35
Float Point OPS trial
Hardware Counters provide Floating Point Operations (Function Data view).
36
L1 Data Cache Miss trial
Hardware Counters provide L1 Cache Miss Operations.
37
Call Path
Call Graph Paths (Must select through “thread” menu.)
38
Call Path
TAU_MAKEFILE =
…Makefile.tau-callpath-icpc-mpi-pdt
39
Derived Metrics
Select Argument 1 (green ball); Select Argument 2 (green ball);
Select Operation; then Apply. Derived Metric will appear as a new trial.
40
Derived
Metrics
Since FP/Miss
ratios are
constant– must
be memory
access problem.
Be careful even though
ratios are constant, cores
may do different amounts
of work/operations per call.
GPROF
GPROF is the GNU Project PROFiler.
gnu.org/software/binutils/
•Requires recompilation of the code.
•Compiler options and libraries provide wrappers for each
routine call, and periodic sampling of the program.
•A default gmon.out file is produced with the function call
information.
•GPROF links the symbol list in the executable with the data
in gmon.out.
Types of Profiles
• Flat Profile
– CPU time spend in each function (self and cumulative)
– Number of times a function is called
– Useful to identify most expensive routines
• Call Graph
–
–
–
–
Number of times a function was called by other functions
Number of times a function called other functions
Useful to identify function relations
Suggests places where function calls could be eliminated
• Annotated Source
– Indicates number of times a line was executed
Profiling with gprof
Use the -pg flag during compilation:
% gcc -g -pg srcFile.c
% icc -g -pg srcFile.c
% pgcc -g -pg srcFile.c
Run the executable. An output file gmon.out will be generated
with the profiling information.
Execute gprof and redirect the output to a file:
% gprof exeFile gmon.out> profile.txt
% gprof -l exeFile gmon.out> profile_line.txt
% gprof -A exeFile gmon.out>profile_anotated.txt
gprof example
Move into the profiling examples directory:
% cd profile/
Compile matvecop with the profiling flag:
% gcc -g -pg -lm matvecop.c {@ ~train00/matvecop.c}
Run the executable to generate the gmon.out file:
% ./a.out
Run the profiler and redirect output to a file:
% gprof a.out gmon.out > profile.txt
Open the profile file and study the instructions.
Visual Call Graph
main
sysSqrt
matCube
matSqrt
vecSqrt
vecCube
sysCube
Flat profile
In the flat profile we can identify the most expensive parts of the code (in this case, the
calls to matSqrt, matCube, and sysCube).
%
cumulative
time
seconds
50.00
2.47
24.70
3.69
24.70
4.91
0.61
4.94
0.00
4.94
0.00
4.94
0.00
4.94
self
seconds
2.47
1.22
1.22
0.03
0.00
0.00
0.00
calls
2
1
1
1
2
1
1
self
s/call
1.24
1.22
1.22
0.03
0.00
0.00
0.00
total
s/call
1.24
1.22
1.22
4.94
0.00
1.24
0.00
name
matSqrt
matCube
sysCube
main
vecSqrt
sysSqrt
vecCube
Call Graph Profile
index % time
self children
called
name
0.00
0.00
1/1
<hicore> (8)
[1]
100.0
0.03
4.91
1
main [1]
0.00
1.24
1/1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
1.22
0.00
1/1
sysCube [5]
1.22
0.00
1/1
matCube [4]
0.00
0.00
1/2
vecSqrt [6]
0.00
0.00
1/1
vecCube [7]
----------------------------------------------1.24
0.00
1/2
main [1]
1.24
0.00
1/2
sysSqrt [3]
[2]
50.0
2.47
0.00
2
matSqrt [2]
----------------------------------------------0.00
1.24
1/1
main [1]
[3]
25.0
0.00
1.24
1
sysSqrt [3]
1.24
0.00
1/2
matSqrt [2]
0.00
0.00
1/2
vecSqrt [6]
-----------------------------------------------
Profiling dos and don’ts
DO
DO NOT
• Test every change you
make
• Profile typical cases
• Compile with
optimization flags
• Test for scalability
• Assume a change will
be an improvement
• Profile atypical cases
• Profile ad infinitum
– Set yourself a goal or
– Set yourself a time limit
PAPI Implementation
Tools
Portable
Layer
Machine
Specific
Layer
PAPI High Level
PAPI Low Level
PAPI Machine
Dependent Substrate
Kernel Extension
Operating System
Hardware Performance Counter
50
•
r
m
a
n
Provides high level counters for events:
c
– Floating point instructions/operations,
– Total instructions and cycles
e
– Cache accesses and misses
– Translation Lookaside Buffer (TLB) counts
– Branch instructions taken, predicted, mispredicted
M
– Wall and processor times
o and MFLOPS
– Total floating point operations
http://icl.cs.utk.edu/projects/papi
n
Low level functions are thread-safe, high level are not
i
t
o
• PAPI_flops routine for basic performance analysis
•
51
High-level API
• C interface
PAPI_start_counters
PAPI_read_counters
PAPI_stop_counters
PAPI_accum_counters
PAPI_num_counters
PAPI_flips
PAPI_ipc
• Fortran interface
PAPIF_start_counters
PAPIF_read_counters
PAPIF_stop_counters
PAPIF_accum_counters
PAPIF_num_counters
PAPIF_flips
PAPIF_ipc
Instrumenting Code:
1.) Specify “events” in an array.
2.) Initialize library.
3.) Start counting events.
4.) Stop counting (and accumulate).
52
Download