MPE/Jumpshot Evaluation Report Adam Leko Hans Sherburne, UPC Group

advertisement
MPE/Jumpshot Evaluation Report
Color encoding key:
Adam Leko
Hans Sherburne,
UPC Group
Blue: Information
Red: Negative note
Green: Positive note
HCS Research Laboratory
University of Florida
Basic Information



Name: MPE/Jumpshot
Developer: Argonne National Labratory
Current versions:




MPE 1.26
Jumpshot-4
Website:
http://www-unix.mcs.anl.gov/perfvis/
Contacts:




Anthony Chan (chan@mcs.anl.gov)
David Ashton (ashton@mcs.anl.gov)
Rusty Lusk (lusk@mcs.anl.gov)
William Gropp (gropp@mcs.anl.gov)
2
What Is MPE/Jumpshot?


The “quintessential” MPI logging and post-mortem
visualization toolset
MPE – Multi-Processing Environment


A software package for MPI programmers
Has three main parts:
 A tracing library that outputs all MPI calls to stdout
 A shared-display parallel X graphics and animation library
 A logging library for logging events


Note: MPE/Jumpshot “logging” -> what we call tracing
Jumpshot




A visualization tool for logfiles created by the MPE package
Written in Java (crossplatform)
Provides a “time line” (GANTT) view of MPI and program
events
Also has basic search and summary (histogram) functionality
3
Logfiles: What’s In A Format?

Much thought has been put into logfile
formats



“Traditional” tracing results in large trace files
Trace file format can play a large part in
visualization tool’s response time
ALOG: original format (Argonne LOGging
format)


Text-based format
Visualization tool: Upshot




Turned out to be too slow
Parts rewritten in C (“Nupshot”) but Tcl->C
interface kept changing
BLOG: intermediary format
CLOG


Binary file format created to improve upon
ALOG and BLOG
Visualization tools



Low overhead
Can be easily converted to other formats
as needed
SLOG: “scalable” format


State-based logging format
Visualization tool: Jumpshot-3



Rewrite of Jumpshot-2 to use SLOG
Can scale to ~GB logfiles
SLOG-2: Current logfile format



Next-generation SLOG file format
“Graphical” logfile format to speed
logfile parsing
Visualization tool: Jumpshot-4
Jumpshot-1




By default, MPE still outputs logfiles in
CLOG

An X-windows application using the Athena
widget toolset
Later rewritten using Tcl/Tk for easy coding



Complete rewrite of Upshot/Nupshot
Coded in Java/AWT for cross-platformness
Bad performance, not widely used
Jumpshot-2


Improved version using Java/SWING
Slightly better performance
4
MPE Overview

Tracing capability

Automatic instrumentation: mpicc –mpitrace

Writes to stdout at every MPI call, eg





[1] Starting MPI_Send with count = 28, dest = 0, tag = 0...
[1] Ending MPI_Send
Equivalent “manual” method: printf
Very simple & intuitive
Parallel graphics ability

Automatic instrumentation: mpicc -mpianim -L/usr/X11R6/lib lX11 –lm




Displays graphics on one machine
Circle for each process, arrow indicate sends/receives
Slows down execution considerably
Graphics are also available via library calls


Calls seem relatively easy to use: MPE_Draw_string, MPE_Draw_circle,
MPE_Update, etc
Probably not all that useful
5
MPE Overview (2)

Logging ability

Automatic instrumentation: mpicc -mpilog

Logs start and stop of events
Can overlap starting and stopping of events
Can add “custom” events



Easy to do using library calls





MPE_Log_get_event_number: create a new event
MPE_Describe_state: gives name and color to event
MPE_Log_event: records event in logfile, uses MPI_Wtime to get global
time
Custom events show up in Jumpshot-4 just like events from
automatic instrumentation
Conventions


Automatic instrumentation uses all caps (SEND, RECV)
Manual instrumentation uses mixed case
6
MPE Overhead


All programs executed correctly when instrumented 
Expect about 5% overhead of “real-world” applications
Barrier recording mechanism has a lot of overhead
Most applications don’t use a bunch of barriers, though

MPE logging overhead
PP: Wrong way
0%
PP: System time
2%
1%
PP: Small messages
PP: Random barrier
Benchmark

56%
PP: Ping pong
1%
1%
PP: Intensive server
PP: Hot procedure
0%
PP: Diffuse procedure
PP: Big message
48%
0%
NAS LU (32p, B)
5%
NAS LU (8p, W)
CAMEL
4%
0%
0%
10%
20%
30%
40%
50%
60%
Overhead (instrumented/uninstrumented)
7
MPE Overhead: Barriers


Programs that have large measurement overhead shown
below
Tons of barriers! (yellow)
PPerfMark: diffuse procedure
PPerfMark: random barrier
8
Jumpshot Overview

Jumpshot-4 supports two types of
visualizations for metrics



Timeline (right, top)
Histogram (right, bottom)
Visualization is dependant on SLOG2 format and Data model

Real drawables




State – Single timeline ID, start/end
timestamp
Arrow – Pair of timeline IDs, start/end
timestamp
Event – Single timeline ID, single
timestamp
Timeline view
Preview drawables



Amalgamation of real drawables
One corresponding type for each of
the real drawables
Serve to optimize performance of
visualization
Histogram view
9
Jumpshot Overview (2)

Emphasis on providing useful profile analysis
from



Nice features





High-level (entire program execution) view
Low-level (individual events) view
Intuitive interface
Automatically converts from CLOG to SLOG-2
Very good support for zooming and scrolling
User manual very thorough
Timeline view
Things that could use improvement

Java application -> uses a lot of memory (~70100MB during typical runs)



Memory uses seems to scale nicely with logfile
size though
No direct support for non-event-based data
(running averages, time-varying histograms for
cache miss numbers, etc)
Documentation a little unclear/excessively technical
in some places
Histogram view
10
Bottleneck Identification Test Suite


Testing metric: what did trace visualization tell us (automatic
instrumentation)?
CAMEL: PASSED




Identified large number of small messages at beginning of program
execution
Also identified sequential parts of algorithm (sort on node 0, etc)
No other problems visible from trace
NAS LU (“W” workload): PASSED

Showed communication bottlenecks very clearly



Large(!) number of small messages
Illustrated time taken for repartitioning data
Shows sensitivity to latency for processors waiting on data from other
processors
11
Bottleneck Identification Test Suite (2)

Big message: PASSED







Small messages: PASSED


CLOG trace file conversion failed (no
communication events)
Even if trace loaded, no communication
problems
Traces illustrated a large number of messages
being sent to node 0
System time: FAILED



Traces showed that one noe was doing more
work than the others
CLOG trace file conversion failed (no
communication events)
Even if trace loaded, no communication problems
Wrong way: PASSED

First receive took a long time for message to
arrive in trace
Intensive server: PASSED


Traces illustrated a lot of synchronization with
one process doing more work
Since no source code correlation, hard to tell
why problem existed
Random barrier: PASSED
Hot procedure: FAILED


Traces illustrated large amount of time spent
in send and receive
Diffuse procedure: PASSED


Traces showed that other nodes were waiting
on node 0
Ping pong: PASSED


Traces illustrated that the application was
very latency-sensitive
Much time being spent on waiting for
messages to arrive
12
NAS LU (Class W) Visualization
Much time
taken for data
redistribution
Large number
of small
messages
13
General Comments

Good things






Jumpshot-4 represents a well-written, scalable event-based tracefile
viewer
Formats used by Jumpshot are well-defined
Low measurement overhead in MPICH
Mature GUI, few bugs, has been around for a long time in one form or
another
To leverage, just need to write logfile in a specific format
Things that could use improvement

Adding support for metrics other than events would require hacking
SLOG-2 format




E.g., how to support showing L-2 miss rates as time increases?
Seems like it would be best used as part of our toolkit
Automatic instrumentation really necessary to make tool useful
Jumpshot-4 can fit in our toolkit as an event-based tracefile viewer if we
can easily write to a format it understands
14
Adding UPC/SHMEM Support

At a minimum, need mechanism to output CLOG trace files

CLOG library currently uses many MPI calls



E.g., MPI_Wtime for timing information
Therefore, cannot just insert MPE logging calls and use the MPE library unmodified 
However, CLOG format is defined


Could (relatively) easily create a C implementation that used UPC calls instead of MPI calls
Would need to come up with our own buffering scheme though




Could also use slog2sdk SDK kit for writing to SLOG-2 files directly, but



Can’t write files as data comes in, too slow
Should be able to steal a lot of code from MPE source
Not necessarily a problem, since we will most likely have to come up with a method if we go the tracing route anyways
API in Java only
SLOG-2 may have larger creation overhead than simple event-based formats such as
CLOG
Several examples (and example C code) given for converting logfiles of arbitrary
format to SLOG-2 format using slog2sdk


Can use our own log file format if needed!
Recommend going with CLOG though, so we can steal existing code
15
Evaluation (1)

Available metrics: 1/5




Cost: free 5/5
Documentation quality: 3.5/5







Can restrict event types being displayed from trace
Preview drawables and histograms provide aggregation abilities
Does not filter or aggregate data directly when recording data
Hardware support: 4/5



Jumpshot-4 written in Java (easy to find Java coders at UF)
Can easily add new events using MPE library calls
Adding time-varying metrics (histograms, etc) would require writing code from scratch
Filtering and aggregation: 3/5


Jumpshot-4 has a very good but lengthy user’s manual
slog2sdk (SDK for reading/writing SLOG-2 files) is not very clear, although SLOG-2 is also described in a
lengthy paper
Extensibility: 3.5/5


Only communication-based metrics (timeline + histograms) available
Restricted to recording event-based metrics
64-bit Linux (Opteron, Itanium), Tru64 (AlphaServer), IRIX, IBM SP (AIX), Cray MPI
Can be used with any MPICH or LAM installation many more
Heterogeneity support: 0/5 (not supported)
16
Evaluation (2)

Installation: 5/5



Interoperability: 0.5/5




Easy to learn, well-written documentation
MPE really easy to use (mpicc -mpilog)
Manual overhead: 1/5




No way provided to export SLOG-2 files to other viewers
Example code provided in slog2sdk on how to convert existing formats into SLOG-2 format
Learning curve: 4.5/5


About as easy as you could expect
Zero effort if using MPICH already, compiling from source also easy
All MPI calls automatically instrumented for you when linking against MPE
Adding other events requires manual work (not much though)
No way to turn on/off tracing in places without recompilation
Measurement accuracy: 5/5




CAMEL overhead < 1%
Correctness of programs not affected
Measurements seem accurate to millisecond (relies on MPI_Wtime resolution though)
Only large numbers of messages (106 or more back-to-back) or frequent barriers seem to introduce any
appreciable overhead
17
Evaluation (3)


Multiple executions: 0/5 (not supported)
Multiple analyses & views: 2/5



Performance bottleneck identification: 4.5/5




Only shows timeline and histograms (but does both very well)
Excellent zooming and scrolling features (scalable to GB logfiles)
No automatic methods supported
Traces do very good job of showing communication and synchronization
bottlenecks
Can also use custom events to indirectly determine some types of bottlenecks
(e.g., load imbalance)
Profiling/tracing support: 3/5

Only supports tracing
Trace format compact & scalable so viewer can comfortably show GB logfiles
Automatic tracing is either entirely on or entirely off

Turning on/off manual tracing requires code modification and recompilation


18
Evaluation (4)

Response time: 2/5


No results until after run
For an 850MB CLOG tracefile:



However, large trace files will be slower than a method that incorporates more filtering and
aggregation




Supports C & Fortran
Tied closely to MPI applications
Supports linking with any library supported by GCC/platform C compiler, but linked libraries will not
be profiled unless they contain MPI calls
Source code correlation: 1/5



Limitation of tracing method, not tool implementation
Software support: 3/5


Converting to SLOG-2 took 5 minutes
Opening up 350MB SLOG-2 file took about 10 seconds
Not directly supported
Can correlate indirectly by using custom events at function entry/exit points
Searching: 1.5/5

Only a simple search function available
19
Evaluation (5)

System stability: 4.5/5




MPE very stable (no problems observed)
Jumpshot-4 has very few bugs (small ones exist but do not get in the way)
Extremely good for a freely-downloadable research project
Technical support: 4/5



Jumpshot-4 does give very good error messages
Developers responded within 24 hours
Developers willing to help point us in the right direction for writing SLOG-2
files using their APIs
20
Download