Tool Visualizations, Metrics, and Profiled Entities Overview Adam Leko HCS Research Laboratory

advertisement
Tool Visualizations, Metrics, and
Profiled Entities Overview
Adam Leko
HCS Research Laboratory
University of Florida
Summary

Give characteristics of existing tools to aid our design discussions
 Metrics (what is recorded, any hardware counters, etc)
 Profiled entities
 Visualizations

Most information & some slides taken from tool evaluations

Tools overviewed
 TAU
 Paradyn
 MPE/Jumpshot
 Dimemas/Paraver/MPITrace
 mpiP
 Dynaprof
 KOJAK
 Intel Cluster Tools (old Vampir/VampirTrace)
 Pablo
 MPICL/Paragraph
2
TAU

Metrics recorded

Two modes: profile, trace

Profile mode




Trace mode



Inclusive/exclusive time spent in functions
Hardware counter information

PAPI/PCL: L1/2/3 cache reads/writes/misses, TLB misses,
cycles, integer/floating point/load/store/stalls executed, wall
clock time, virtual time

Other OS timers (gettimeofday, getrusage)
MPI message size sent
Same as profile (minus hardware counters?)
Message send time, message receive time, message size,
message sender/recipient(?)
Profiled entities

Functions (automatic & dynamic), loops + regions (manual
instrumentation)
3
TAU

Visualizations

Profile mode



Text-based: pprof (example next slide), shows a
summary of profile information
Graphical: racy (old), jracy a.k.a. paraprof
Trace mode


No built-in visualizations
Can export to CUBE (see KOJAK), Jumpshot (see
MPE), and Vampir format (see Intel Cluster Tools)
4
TAU – pprof output
Reading Profile files in profile.*
NODE 0;CONTEXT 0;THREAD 0:
--------------------------------------------------------------------------------------%Time
Exclusive
Inclusive
#Call
#Subrs Inclusive Name
msec
total msec
usec/call
--------------------------------------------------------------------------------------100.0
0.207
20,011
1
2
20011689 main() (calls f1, f5)
75.0
1,001
15,009
1
2
15009904 f1() (sleeps 1 sec, calls f2,
75.0
1,001
15,009
1
2
15009904 main() (calls f1, f5) => f1()
1 sec, calls f2, f4)
50.0
4,003
10,007
2
2
5003524 f2() (sleeps 2 sec, calls f3)
45.0
4,001
9,005
1
1
9005230 f1() (sleeps 1 sec, calls f2,
f4() (sleeps 4 sec, calls f2)
45.0
4,001
9,005
1
1
9005230 f4() (sleeps 4 sec, calls f2)
30.0
6,003
6,003
2
0
3001710 f2() (sleeps 2 sec, calls f3)
(sleeps 3 sec)
30.0
6,003
6,003
2
0
3001710 f3() (sleeps 3 sec)
25.0
2,001
5,003
1
1
5003546 f4() (sleeps 4 sec, calls f2)
(sleeps 2 sec, calls f3)
25.0
2,001
5,003
1
1
5003502 f1() (sleeps 1 sec, calls f2,
f2() (sleeps 2 sec, calls f3)
25.0
5,001
5,001
1
0
5001578 f5() (sleeps 5 sec)
25.0
5,001
5,001
1
0
5001578 main() (calls f1, f5) => f5()
5 sec)
f4)
(sleeps
f4) =>
=> f3()
=> f2()
f4) =>
(sleeps
5
TAU – paraprof
6
Paradyn

Metrics recorded







Number of CPUs, number of active threads, CPU and inclusive
CPU time
Function calls to and by
Synchronization (# operations, wait time, inclusive wait time)
Overall communication (# messages, bytes sent and received),
collective communication (# messages, bytes sent and received),
point-to-point communication (# messages, bytes sent and
received)
I/O (# operations, wait time, inclusive wait time, total bytes)
All metrics recorded as “time histograms” (fixed-size data
structure)
Profiled entities

Functions only (but includes functions linked to in existing
libraries)
7
Paradyn

Visualizations




Time histograms
Tables
Barcharts
“Terrains” (3-D histograms)
8
Paradyn – time histograms
9
Paradyn – terrain plot (histogram across
multiple hosts)
10
Paradyn – table (current metric values)
11
Paradyn – bar chart (current or average
metric values)
12
MPE/Jumpshot

Metrics collected



Profiled entities



MPI message send time, receive time, size,
message sender/recipient
User-defined event entry & exit
All MPI functions
Functions or regions via manual instrumentation
and custom events
Visualization

Jumpshot: timeline view (space-time diagram
overlaid on Gantt chart), histogram
13
Jumpshot – timeline view
14
Jumpshot – histogram view
15
Dimemas/Paraver/MPITrace

Metrics recorded (MPITrace)


All MPI functions
Hardware counters (2 from the
following two lists, uses PAPI)


Counter 2


Counter 1











Cycles
Issued instructions, loads, stores,
store conditionals
Failed store conditionals
Decoded branches
Quadwords written back from
scache(?)
Correctible scache data array
errors(?)
Primary/secondary I-cache misses
Instructions mispredicted from
scache way prediction table(?)
External interventions (cache
coherency?)
External invalidations (cache
coherency?)
Graduated instructions






Cycles
Graduated instructions,
loads, stores, store
conditionals, floating point
instructions
TLB misses
Mispredicted branches
Primary/secondary data
cache miss rates
Data mispredictions from
scache way prediction
table(?)
External
intervention/invalidation
(cache coherency?)
Store/prefetch exclusive to
clean/shared block
16
Dimemas/Paraver/MPITrace

Profiled entities (MPITrace)



All MPI functions (message start time, message end time,
message size, message recipient/sender)
User regions/functions via manual instrumentation
Visualization

Timeline display (like Jumpshot)




Shows Gantt chart and messages
Also can overlay hardware counter information
Clicking on timeline brings up a text listing of events near
where you clicked
1D/2D analysis modules
17
Paraver – timeline (standard)
18
Paraver – timeline (HW counter)
19
Paraver – text module
20
Paraver – 1D analysis
21
Paraver – 2D analysis
22
mpiP

Metrics collected


Profiled entities


Start time, end time, message size for each MPI call
MPI function calls + PMPI wrapper
Visualization


Text-based output, with graphical browser that displays statistics in-line with
source
Displayed information:









Overall time (%) for each MPI node
Top 20 callsites for time (MPI%, App%, variance)
Top 20 callsites for message size (MPI%, App%, variance)
Min/max/average/MPI%/App% time spent at each call site
Min/max/average/sum of message sizes at each call site
App time = wall clock time between MPI_Init and MPI_Finalize
MPI time = all time consumed by MPI functions
App% = % of metric in relation to overall app time
MPI% = % of metric in relation to overall MPI time
23
mpiP – graphical view
24
Dynaprof

Metrics collected



Profiled entities


Wall clock time or PAPI metric for each profiled
entity
Collects inclusive, exclusive, and 1-level call tree
% information
Functions (dynamic instrumentation)
Visualizations


Simple text-based
Simple GUI (shows same info as text-based)
25
Dynaprof – output
[leko@eta-1 dynaprof]$ wallclockrpt lu1.wallclock.16143
1-Level Inclusive Call Tree.
Exclusive Profile.
Parent/-Child
Percent
Total
Calls
-------------
-------
-----
--------
Name
Percent
Total
Calls
TOTAL
100
1.436e+11
1
-------------
-------
-----
-------
main
100
1.436e+11
1
TOTAL
100
1.436e+11
1
- f_setarg.0
1.414e-05
2.03e+04
1
unknown
100
1.436e+11
1
- f_setsig.1
1.324e-05
1.902e+04
1
main
3.837e-06
5511
1
-
f_init.2
2.569e-05
3.691e+04
1
-
atexit.3
7.042e-06
1.012e+04
1
-
MAIN__.4
0
0
1
Inclusive Profile.
Name
SubCalls
Percent
Total
-------------
-------
-----
-------
TOTAL
100
1.436e+11
0
main
100
1.436e+11
5
26
KOJAK

Metrics collected




Profiled entities




MPI: message start time, receive time, size, message
sender/recipient
Manual instrumentation: start and stop times
1 PAPI metric / run (only FLOPS and L1 data misses
visualized)
MPI calls (MPI wrapper library)
Function calls (automatic instrumentation, only available on
a few platforms)
Regions and function calls via manual instrumentation
Visualizations


Can export traces to Vampir trace format (see ICT)
Shows profile and analyzed data via CUBE (described on
next few slides)
27
CUBE overview: simple description

Uses a 3-pane approach to display
information


Metric pane
Module/calltree pane



Location pane (system tree)
Each item is displayed along with a
color to indicate severity of condition

Severity can be expressed 4 ways





Right-clicking brings up source code
location
Absolute (time)
Percentage
Relative percentage (changes
module & location pane)
Comparative percentage (differences
between executions)
Despite documentation, interface is
actually quite intuitive
28
CUBE example: CAMEL
After opening the .cube file
(default metric shown =
absolute time take in
seconds)
29
CUBE example: CAMEL
After expanding all 3 root
nodes; color shown
indicates metric “severity”
(amount of time)
30
CUBE example: CAMEL
Selecting “Execution”
shows execution time,
broken down into part of
code & machine
31
CUBE example: CAMEL
Selecting mainloop
adjusts system tree to only
show time spent in
mainloop per each
processor
32
CUBE example: CAMEL
Expanded nodes show
exclusive metric (only time
spent by node)
33
CUBE example: CAMEL
Collapsed nodes show
inclusive metric (time spent
by node and all children
nodes)
34
CUBE example: CAMEL
Metric pane also shows
detected bottlenecks; here,
shows “Late Sender” in
MPI_Recv within main
spread across all nodes
35
Intel Cluster Tools (ICT)

Metrics collected




Instrumented entities




MPI functions: start time, end time, message size,
message sender/recipient
User-defined events: counter, start & end times
Code location for source-code correlation
MPI functions via wrapper library
User functions via binary instrumentation(?)
User functions & regions via manual instrumentation
Visualizations


Different types: timelines, statistics & counter info
Described in next slides
36
ICT visualizations – timelines &
summaries

Summary Chart Display
 Allows the user to see how
much work is spent in MPI
calls
Fig. 1 Summary Chart

Timeline Display
 Zoomable, scrollable
timeline representation of
program execution
Fig. 2 Timeline Display
37
ICT visualizations – histogram & counters

Summary Timeline
 Timeline/histogram
representation showing the
number of processes in each
activity per time bin
Fig. 3 Summary TImeline

Counter Timeline
 Value over time
representation (behavior
depends on counter
definition in trace)
Fig 4. Counter Timeline
38
ICT visualizations – message stats &
process profiles

Message Statistics Display
 Message data to/from each
process (count,length, rate,
duration)
Fig. 5 Message Statistics

Process Profile Display
 Per process data regarding
activities
Fig. 6 Process Profile Display
39
ICT visualizations – general stats & call
tree

Statistics Display
 Various statistics
regarding activities in
histogram, table, or text
format
Fig. 7 Statistics Display

Call Tree Display
Fig. 8 Call Tree Display
40
ICT visualizations – source & activity
chart

Source View
 Source code
correlation with
events in Timeline
Fig 9. Source View

Activity Chart
 Per Process
histograms of
Application and MPI
activity
Fig. 10 Activity Chart
41
ICT visualizations – process timeline &
activity chart

Process Timeline
 Activity timeline and
counter timeline for a
single process
Figure 11. Process Timeline


Process Activity Chart
 Same type of
informartion as Global
Summary Chart
Process Call Tree
 Same type of
information as Global
Call Tree
Figure 12. Process Activity Chart & Call Tree
42
Pablo

Metrics collected



Time inclusive/exclusive of a function
Hardware counters via PAPI
Summary metrics computed from timing info


Profiled entities



Min/max/avg/stdev/count
Functions, function calls, and outer loops
All selected via GUI
Visualizations


Displays derived summary metrics color-coded and inline
with source code
Shown on next slide
43
SvPablo
44
MPICL/Paragraph

Metrics collected



Profiled entities



MPI functions: start time, end time, message size,
message sender/recipient
Manual instrumentation: start time, end time, “work” done
(up to user to pass this in)
MPI function calls via PMPI interface
User functions/regions via manual instrumentation
Visualizations


Many, separated into 4 categories: utilization,
communication, task, “other”
Described in following slides
45
ParaGraph visualizations

Utilization visualizations


Display rough estimate of processor utilization
Utilization broken down into 3 states:







Display different aspects of communication
Frequency, volume, overall pattern, etc.
“Distance” computed by setting topology in options menu
Task visualizations



“Busy” doesn’t necessarily mean useful work is being done since it assumes (not
communication) := busy
Communication visualizations


Idle – When program is blocked waiting for a communication operation (or it has stopped execution)
Overhead – When a program is performing communication but is not blocked (time spent within MPI library)
Busy – if execution part of program other than communication
Display information about when processors start & stop tasks
Requires manually instrumented code to identify when processors start/stop tasks
Other visualizations

Miscellaneous things
46
Utilization visualizations –
utilization count


Displays # of processors in each state at a given moment in
time
Busy shown on bottom, overhead in middle, idle on top
47
Utilization visualizations –
Gantt chart

Displays utilization state of each processor as a
function of time
48
Utilization visualizations –
Kiviat diagram



Shows our friend, the
Kiviat diagram
Each spoke is a single
processor
Dark green shows moving
average, light green shows
current high watermark


Timing parameters for each
can be adjusted
Metric shown can be
“busy” or “busy +
overhead”
49
Utilization visualizations –
streak

Shows “streak” of state




Similar to winning/losing
streaks of baseball
teams
Win = overhead or busy
Loss = idle
Not sure how useful
this is
50
Utilization visualizations –
utilization summary

Shows percentage of time spent in each utilization state up to
current time
51
Utilization visualizations –
utilization meter

Shows percentage
of processors in
each utilization state
at current time
52
Utilization visualizations –
concurrency profile


Shows histograms of
# processors in a
particular utilization
state
Ex: Diagram shows


Only 1 processor was
busy ~5% of the time
All 8 processors were
busy ~90% of the time
53
Communication visualizations –
color code


Color code controls colors used on most
communication visualizations
Can have color indicate message sizes, message
distance, or message tag

Distance computed by topology set in options menu
54
Communication visualizations –
communication traffic

Shows overall traffic at a given time



Bandwidth used, or
Number of messages in flight
Can show single node or aggregate of all nodes
55
Communication visualizations –
spacetime diagram

Shows standard space-time diagram for
communication

Messages sent from node to node at which times
56
Communication visualizations –
message queues

Shows data about message queue lengths



Incoming/outgoing
Number of bytes queued/number of messages queued
Colors mean different things


Dark color shows current moving average
Light color shows high watermark
57
Communication visualizations –
communication matrix

Shows which
processors sent data
to which other
processors
58
Communication visualizations –
communication meter



Show percentage of
communication used at the
current time
Message count or bandwidth
100% = max # of messages /
max bandwidth used by the
application at a specific time
59
Communication visualizations –
animation



Animates messages as they
occur in trace file
Can overlay messages over
topology
Available topologies




Mesh
Ring
Hypercube
User-specified


Can layout each node as you want
Can store to a file and load later on
60
Communication visualizations –
node data


Shows detailed
communication data
Can display

Metrics





Which node
Message tag
Message distance
Message length
For a single node, or
aggregate for all nodes
61
Task visualizations –
task count


Shows number of processors that are executing a task at the
current time
At end of run, changes to show summary of all tasks
62
Task visualizations –
task Gantt

Shows Gantt chart of which task each
processor was working on at a given time
63
Task visualizations –
task speed


Similar to Gantt chart, but displays “speed” of each task
Must record work done by task in instrumentation call (not
done for example shown above)
64
Task visualizations –
task status

Shows which tasks
have started and
finished at the current
time
65
Task visualizations –
task summary


Shows % time spent on each task
Also shows any overlap between tasks
66
Task visualizations –
task surface


Shows time spent
on each task by
each processor
Useful for seeing
load imbalance on a
task-by-task basis
67
Task visualizations –
task work



Displays work done by
each processor
Shows rate and volume
of work being done
Example doesn’t show
anything because no
work amounts recorded
in trace being visualized
68
Other visualizations –
clock, coordinates

Clock


Shows current time
Coordinate information

Shows coordinates when
you click on any
visualization
69
Other visualizations –
critical path

Highlights critical path in space-time diagram in red


Longest serial path shown in red
Depends on point-to-point communication (collective can screw it
up)
70
Other visualizations –
phase portrait

Shows relationship
between processor
utilization and
communication usage
71
Other visualizations –
statistics


Gives overall statistics for run
Data



% busy, overhead, idle time
Total count and bandwidth of
messages
Max, min, average




Message size
Distance
Transit time
Shows max of 16 processors at a
time
72
Other visualizations –
processor status

Shows




Processor status
Which task each
processor is executing
Communication (sends
& receives)
Each processor is a
square in the grid (8processor example
shown)
73
Other visualizations –
trace events

Shows text output of all trace file events
74
Download