mpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group

advertisement
mpiP Evaluation Report
Hans Sherburne,
Adam Leko
UPC Group
HCS Research Laboratory
University of Florida
Basic Information

Name:


Developer:


mpiP v2.8
Website:


Jeffrey Vetter (ORNL), Chris Chambreau (LLNL)
Current versions:


mpiP
http://www.llnl.gov/CASC/mpip/
Contacts:


Jeffrey Vetter: vetterjs@ornl.gov
Chris Chambreau: chcham@llnl.gov
2
mpiP: Lightweight, Scalable MPI Profiling






mpiP is a simple lightweight tool for profiling
Gathers information through the MPI profiling layer
 Probably not good candidate to be extended for UPC or SHMEM
Supports many platforms running Linux, Tru64, AIX, UNICOS, IBM BG/L
Very simple to use, and output file is very easy to understand
 Provides statistics for the top twenty MPI calls based on time spent in
call, and total size of messages sent, also provides statistics for MPI I/O
 Callsite traceback depth is variable to allow user to differentiate between
and examine the behavior of routines that are wrappers for MPI calls
A mpip viewer, Mpipview, is available as part of Tool Gear
Some of its functionality is extended to developers through an API:
stackwalking, address-to-source translation, symbol demangling, timing
routines, accessing the name of the executable
 These functions might be useful is source-code correlation is to be
included in a UPC or SHMEM tool
3
What is mpiP Useful For?



The data collect by mpiP is useful for analyzing the scalability of parallel applications.
By examining the aggregate time and rank correlation of the time spent in each MPI call versus
the total time spent in MPI calls while increasing the number of tasks, one can locate flaws in load
balancing and algorithm design.
This technique is described in [1] “Statistical Scalability Analysis of Communication Operations in
Distributed Applications” –Vetter, J. & McCracken, M.
The following are courtesy of [1]:
4
The Downside…



mpiP does provide the measurements of aggregate callsite time, and total
MPI call time necessary for computing the rank correlation coefficient
mpiP does NOT automate the process of computing the rank correlation,
which must utilize data from multiple experiments
Equations for calculation of coefficients of correlation (linear and rank), care
of [1]:
5
Partial Sample of mpiP Output
6
Information Provided by mpiP

Information displayed in terms of task assignments, and callsites, which correspond
to machines and MPI calls in source code arranged in the following sections:

Time per task


Location of callsite in source code


(max, min, mean, app%, MPI%)
Sent message size statistics per callsite per task (all)


(count, total, avg. MPI%)
Time statistics per callsite per task (all)


(time, app%, MPI%, variance)
Aggregate sent message size per callsite (top twenty)


(callsite, line#, parent function, MPI call)
Aggregate time per callsite (top twenty)


(AppTime, MPITime, MPI%)
(count, max, min, mean, sum)
I/O statistics per callsite per task (all)

(count, max, min, mean, sum)
7
mpiP Overhead
mpiP Profiling Overhead
PP: Wrong way
0%
PP: System time
PP: Small messages
1%
0%
Benchmark
PP: Random barrier
5%
7%
PP: Ping pong
PP: Intensive server
0%
1%
PP: Hot procedure
PP: Diffuse procedure
1%
PP: Big message
3%
2%
NAS LU (32p, B)
NAS LU (8p, W)
3%
5%
CAMEL
0%
1%
2%
3%
4%
5%
6%
7%
8%
Overhead (instrumented/uninstrumented)
8
Source Code Correlation in Mpipview
9
Bottleneck Identification Test Suite


Testing metric: what did profile data tell us?
CAMEL: TOSS-UP





Profile showed that MPI time is a small percentage of overall application time
Profile reveals some imbalance in the amount of time spent in certain calls, but
doesn’t help the user understand the cause
Profile does not provide information about what occurs when execution is not in
MPI calls.
Difficult to grasp overall program behavior from profiling information alone
NAS LU: TOSS-UP




Profile reveals that a MPI function calls consume a significant portion of
application time
Profile reveals some imbalance in the amount of time spent in certain calls, but
doesn’t help the user understand the cause
Profile does not provide information about what occurs when execution is not in
MPI calls.
Difficult to grasp overall program behavior from profiling information alone
10
Bottleneck Identification Test Suite (2)

Big message: PASSED






Profile showed large amount of time spent in
barrier
Time is diffused across processes
Profile does not show that in each barrier a
single process is always delaying completion
Ping pong: PASSED








Profile showed one process spent very little time
in MPI calls, while the remaining processes
spent nearly all their time in Recvs
Profile showed one process sent an order of
magnitude more data than the others, and spent
far more time in Send

Profile clearly shows single process spends
almost all of the total application time in Recv,
and recvs an excessive amount of messages
sent by all the other processes
System time: FAIL


Profile shows that the majority of execution
time is spent in Barrier called by processes
not holding “potato”
Small messages: PASS
No profile output, due to no MPI calls (other
than setup and breakdown
Intensive server: PASSED
Profile showed time spent in MPI function
calls dominated the total application time
Profile showed excessive number of Sends
and Recvs with little load imbalance
Random barrier: PASS
Hot procedure: FAIL


Profile clearly shows that Send and Recv
dominate the application time
Profiles shows a large number of bytes
transfered
Diffuse procedure: FAIL


No profile output, due to no MPI calls (other
than setup and breakdown
Wrong way: TOSS-UP


One process spends most of the execution
time in sends the other spends most of the
execution time in receives
Profile does not reveal the improperly ordered
communication pattern
11
Evaluation (1)

Available metrics: 1/5




Cost: free 5/5
Documentation quality: 4/5




mpiP was designed to be lightweight, and presents statistics for the top twenty callsites
Output size grows with number of tasks (machines)
Hardware support: 5/5


mpiP is designed to use the MPI profiling layer so they would not be readily adapted to UPC
or SHMEM and so it would be of little use
The source code correlation functions work well
Filtering and aggregation: 2/5


Though brief (a single webpage), documentation adequately covers installation and available
functionality
Extensibility: 2/5


Only provides a handful statistical information about time, message size, and frequency of
MPI calls
No hardware counter support
64-bit Linux (Itanium and Opteron), IBM SP (AIX), AlphaServer (Tru64), Cray X1, Cray XD1,
SGI Altix, IBM BlueGene/L
Heterogeneity support: 0/5 (not supported)
12
Evaluation (2)

Installation: 5/5


Interoperability: 1/5



Easy to use
Simple statistics are easily understood
Manual overhead: 1/5



mpiP has it’s own output format
Learning curve: 4/5


About as easy as you could expect
All MPI calls automatically instrumented for you when linking against mpiP library
No way to turn on/off tracing in places without relinking
Measurement accuracy: 4/5



CAMEL overhead: ~5%
Correctness of programs is not affected
Overhead is low (less than 7% for all test suite programs)
13
Evaluation (3)


Multiple executions: 0/5 (not supported)
Multiple analyses & views: 2/5



Performance bottleneck identification: 2.5/5




Statistics regarding MPI calls are displayed in output file
Source code location to callsite correlation provided by Mpipview
No automatic methods supported
Some bottlenecks could be deduced by examining gathered statistics
Lack of trace information makes some bottlenecks impossible to detect
Profiling/tracing support: 2/5



Only supports profiling
Profiling can be enabled for various regions of code by editing source code
Turning on/off profiling requires recompilation

(a runtime environment variable for deactivating profiling is given in
documentation, and acknowledged in the profile output file when set, but
profiling is not disabled)
14
Evaluation (4)

Response time: 3/5




Searching: 0/5 (not supported)
Software support: 3/5





Line numbers of source code provided for each MPI callsites in output file
Automatic source code correlation provided by Mpipview
System stability: 5/5


Supports C, C++, Fortran
Supports a large number of compilers
Tied closely to MPI applications
Source code correlation: 4/5


No results until after run
Quickly assembles report at the end of experimentation run
mpiP and Mpipview work very reliably
Technical support: 5/5

Co-author, Chris Chambreau, responded quickly, and provided good information
allowing us to correct a problem with one of our benchmark apps
15
Download