Program Development Tools SGI Performance Tools TM

advertisement
TM
Program Development Tools
SGI Performance Tools
TM
Timing Mechanisms
•External timers provided by IRIX
– timex
– time
– ssusage
•Internal Timers
TM
Timing Mechanisms
TM
The Event Counters
• The MIPS R1x000 cpu has on-chip event counters that are
used to monitor processor activity
– there are 2 counters in R10000 processor
– each counter has a selector for the events to be counted:
• events
• events
0 - 15
16 - 31
– see man r10k_counters(5) for a complete description
• the R1x000 software counters are available through:
– the IRIX perfex command
– the profiling environment
– a procedural interface (Fortran and C)
TM
The perfex Command
perfex [-e e0][-e e1][-a][-mp][-y][-s][-x] cmd [args]
where
– e0 and e1 are the event identifiers; should belong to non-overlapping
groups (i.e. 0-15 one and 16-31 another)
– -a multiplex all events
–
–
–
–
-mp report counts for each thread, as well as totals
-y give timing estimates for each count and general statistics
-s start/stop counting on the USR1/USR2 signal
-x count events on exception as well
examples:
perfex -e 2 -e 25 ./swp
………………. output from swp execution
Issued loads ……………………………………………………. 13847619
Primary data cache misses …………………. 2310349
perfex -e 3 -e 26 ./swp
………………. output from swp execution
Issued stores ……………………………………………………. 9176152
Secondary data cache misses ……………….
9029
Interpretation of event counters requires a model!
TM
perfex -t: the Model Cost Table
Costs for IP27 processor MIPS R12000 CPU
Cost
Typical
Minimum
Maximum
Event Counter Name
============================================================================================
0 Cycles............................................. 1.00 clks
1.00 clks
1.00 clks
1 Decoded instructions............................... 0.00 clks
0.00 clks
1.00 clks
2 Decoded loads...................................... 1.00 clks
1.00 clks
1.00 clks
3 Decoded stores..................................... 1.00 clks
1.00 clks
1.00 clks
4 Miss handling table occupancy...................... 0.00 clks
0.00 clks
0.00 clks
5 Failed store conditionals.......................... 1.00 clks
1.00 clks
1.00 clks
6 Resolved conditional branches...................... 1.00 clks
1.00 clks
1.00 clks
7 Quadwords written back from scache........... 8.49 clks
5.90 clks
8.77 clks
8 Correctable scache data array ECC errors........... 0.00 clks
0.00 clks
1.00 clks
9 Primary instruction cache misses..............17.01 clks
4.34 clks
17.01 clks
10 Secondary instruction cache misses............99.89 clks
63.03 clks
99.89 clks
11 Instr misprediction from scache way prediction.. .. 0.00 clks
0.00 clks
1.00 clks
12 External interventions............................. 0.00 clks
0.00 clks
0.00 clks
13 External invalidations............................. 0.00 clks
0.00 clks
0.00 clks
14 ALU/FPU progress cycles............................ 1.00 clks
1.00 clks
1.00 clks
15 Graduated instructions............................. 0.00 clks
0.00 clks
1.00 clks
16 Executed prefetch instructions..................... 0.00 clks
0.00 clks
0.00 clks
17 Prefetch primary data cache misses................. 0.00 clks
0.00 clks
1.00 clks
18 Graduated loads.................................... 1.00 clks
1.00 clks
1.00 clks
19 Graduated stores................................... 1.00 clks
1.00 clks
1.00 clks
20 Graduated store conditionals....................... 1.00 clks
1.00 clks
1.00 clks
21 Graduated floating point instructions........ 1.00 clks
0.50 clks
52.00 clks
22 Quadwords written back from primary data cache..... 3.98 clks
3.14 clks
3.98 clks
23 TLB misses........................................77.78 clks
77.78 clks
77.78 clks
24 Mispredicted branches.............................. 7.28 clks
6.00 clks
8.81 clks
25 Primary data cache misses...................... 8.50 clks
2.17 clks
8.50 clks
26 Secondary data cache misses....................99.89 clks
63.03 clks
99.89 clks
27 Data misprediction from scache way prediction table 0.00 clks
0.00 clks
1.00 clks
28 State of intervention hits in scache.........……………. 0.00 clks
0.00 clks
0.00 clks
29 State of invalidation hits in scache............... 0.00 clks
0.00 clks
0.00 clks
30 Store/prefetch exclusive to clean block in scache.. 1.00 clks
1.00 clks
1.00 clks
31 Store/prefetch exclusive to shared block in scache. 1.00 clks
1.00 clks
1.00 clks
TM
perfex -a -y: Multiplexing Events
Event Counter Name
Based on 300 MHz IP27 MIPS R12000 CPU
Typical
Minimum
Maximum
Counter Value Time (sec) Time (sec) Time (sec)
=========================================================================================================
0 Cycles........................................... 5019528970688 16731.763236 16731.763236 16731.763236
16 Executed prefetch instructions................………... 8606259264
0.000000
0.000000
0.000000
4 Miss handling table occupancy.................….. 5443332412048 18144.441373 18144.441373 18144.441373
18 Graduated loads...............................…... 938555860624 3128.519535 3128.519535 3128.519535
2 Decoded loads..................................... 935411271824 3118.037573 3118.037573 3118.037573
25 Primary data cache misses......................... 83185778832 2356.930400
601.710467 2356.930400
21 Graduated floating point instructions............. 519085469744 1730.284899
865.142450 89974.814756
3 Decoded stores.................................... 413277338336 1377.591128 1377.591128 1377.591128
19 Graduated stores.................................. 411214746144 1370.715820 1370.715820 1370.715820
22 Quadwords written back from primary data cache.... 63439898464
841.635986
664.004271
841.635986
6 Resolved conditional branches..................... 231879145424
772.930485
772.930485
772.930485
26 Secondary data cache misses.......................
1951064336
649.639388
409.918617
649.639388
7 Quadwords written back from scache................ 13451750864
380.684549
264.551100
393.239517
23 TLB misses........................................
1238242032
321.034884
321.034884
321.034884
9 Primary instruction cache misses..................
1719354640
97.487408
24.873330
97.487408
24 Mispredicted branches.............................
3744555216
90.867873
74.891104
109.965105
10 Secondary instruction cache misses................
26286992
8.752692
5.522897
8.752692
31 Store/prefetch exclusive to shared block in scache
1435428608
4.784762
4.784762
4.784762
20 Graduated store conditionals......................
639716576
2.132389
2.132389
2.132389
5 Failed store conditionals.........................
54079168
0.180264
0.180264
0.180264
30 Store/prefetch exclusive to clean block in scache.
20579264
0.068598
0.068598
0.068598
1 Decoded instructions............................. 3312748123920
0.000000
0.000000 11042.493746
8 Correctable scache data array ECC errors......…
0
0.000000
0.000000
0.000000
11 Instruction mispred from scache way prediction table..166448576
0.000000
0.000000
0.554829
12 External interventions............................
1584014720
0.000000
0.000000
0.000000
13 External invalidations............................
3304023728
0.000000
0.000000
0.000000
14 ALU/FPU progress cycles...........................
0
0.000000
0.000000
0.000000
15 Graduated instructions........................... 3268746182224
0.000000
0.000000 10895.820607
…
TM
perfex -y Statistics (R12000)
Statistics
=========================================================================================
Graduated instructions/cycle................................................
0.651206
Graduated floating point instructions/cycle.................................
0.103413
Graduated loads & stores/cycle..............................................
0.268904
Graduated loads & stores/floating point instruction.........................
2.600286
Mispredicted branches/Resolved conditional branches.........................
0.016149
Graduated loads /Decoded loads ( and prefetches )...........................
0.994214
Graduated stores/Decoded stores.............................................
0.995009
Data mispredict/Data scache hits............................................
0.036721
Instruction mispredict/Instruction scache hits..............................
0.098312
L1 Cache Line Reuse.........................................................
15.225978
L2 Cache Line Reuse.........................................................
41.636102
L1 Data Cache Hit Rate......................................................
0.938370
L2 Data Cache Hit Rate......................................................
0.976546
Time accessing memory/Total time............................................
0.467783
L1--L2 bandwidth used (MB/s, average per process)...........................
219.760658
Memory bandwidth used (MB/s, average per process)...........................
27.789316
MFLOPS (average per process)................................................
31.023955
Cache misses in flight per cycle (average)..................................
1.084431
Prefetch cache miss rate....................................................
0.606133
TM
Explanation of Statistics/1
• L1 cache line reuse: should be large >>1
– it is the number of times, on average that a primary data cache line is used
after it has been moved into the cache
L1 cache line reuse = loads + stores - misses
misses
• L2 cache line reuse: should be large >>1
– it is the average number of times the cache line was used after it moved into
the cache
L2 cache line reuse = L1 cache misses - L2 cache misses
L2 cache misses
• L1 data cache hit rate: should be ~1
– the fraction of data accesses that finds data already resident in L1
L1 data cache hit rate = 1- L1 cache misses
loads + stores
• L2 data cache hit rate: should be ~1
– the fraction of data accesses that finds data already resident in L2
L2 data cache hit rate = 1- L2 cache misses
L1 cache misses
TM
Explanation of Statistics/2
• Instruction or data mispredictions/hits: should be small (~0)
• L1-L2 bandwidth used ([MB/s] average per process)
– total number of bytes moved between L1 and L2 divided by total run time
#bytes moved = (L1 cache miss * L1 cache line size + Quadwords written back *16 bytes)
• Memory bandwidth used ([MB/s] average per process)
– total number of bytes moved between L2 an memory divided by total time
#bytes moved = (L2 cache miss * L2 cache line size + Quadwords written back *16 bytes)
• Performance Mflop/s: larger is better
– it is computed from the #fp instructions divided by total run time
– the madd instruction counts as 1, so this statistic usually greatly
underestimates the performance in Mflop/s
• Time accessing memory/Total time: should be small (~0)
T(loads+stores+L1 miss+L2 miss+TLB miss)
total run time
TM
Explanation of Statistics/3
The R12000 processor gives additional information:
• Miss handling table occupancy:
– average number of entries in MHT. A higher value indicates more
outstanding cache misses. An interesting observation is to get
Mh=
MHT
cycles
= 0 .. 5
– it should be >1, since if it is close to 1 there might be dependency in the data
that prevents the processor to issue multiple outstanding cache misses
– it could be <1 for code that fits into the cache
– high values for Mh while the memory bandwidth is low would indicate a lot
of L1 misses into the same L2 cache line
• Prefetch miss rate:
– rate of prefetch instructions missing in the first level cache
L1 miss prefetch(17)
Rp= #prefetch instructions(16) = 0 .. 1
– should be close to 1:
Every prefetch to be effective should miss in the cache.
TM
R1x000 Important Events
R10000
Event Description
0
7
9
10
12
13
14
15
18
19
21
22
23
25
26
28
29
30
31
R12000
Event
Desctiption
Cycles
4
Miss handling table occupancy
Quadwords written back from L2
L1 instruction cache misses
L2 instruction cache misses
External interventions
External invalidations
ALU/FPU forward progress cycles
Graduated instructions
16
Prefetch instructions
Graduated loads
Graduated stores
Graduated fp instructions
Quadwords written back from L1
TLB misses
L1 data cache misses
L2 data cache misses
External intervention hits in L2
External invalidation hits in L2
Store/prefetch exclusive to clean block in L2
Store/prefetch exclusive to shared block in L2
TM
Cache Coherency Counters
• The directory cache coherency protocol sends signals to nodes that
requested data from memory in the past
• Invalidations are transactions that convert cache lines from the
“shared” to invalid state
• Interventions are transactions that convert cache line from the
exclusive to “shared” or “invalid” (depending on whether the requestor
asked for “shared” or “exclusive” access)
• Counters 12 and 13 tell how many interventions and invalidations
have been received by the processor; counters 28 and 29 tell how
many of them actually hit in the cache
• Invalidations and/or interventions are sent to a node because the
cache line owner thinks that node has the cache line. However, if
that processor has dropped the cache line earlier, the signal will not
find the data in the cache and counters 12 or 13 will be updated,
while 28 or 29 will not.
TM
Explanation: Useful Counters
• Cycles are proportional to time spend on the cpu
– spin times are added, but wait times are not
• Graduated instructions divided by cycles will give the IPC
measure for the program. The R1x000 processor can process
up to 4 IPC, a value of IPC >1.5 should be considered good.
• ALU/FPU progress cycles are the actual computation on the
processor
• TLB misses is counting how many times the virtual
translation entry has not been found in the Translation Lookaside Table. If that is a large contribution, larger size pages
might be used in the application
Event Counters
Procedural Interface
C synopsis
int start_counters(int e0, int e1);
int read_counters( int e0, long long *c0,
int e1, long long *c1);
int print_counters(int e0, long long c0,
int e1, long long c1);
Fortran synopsis
INTEGER*8
INTEGER*4
INTEGER*4
INTEGER*4
INTEGER*4
C0, C1
E0, E1
FUNCTION START_COUNTERS(E0, E1)
FUNCTION READ_COUNTERS(E0, C0, E1, C1)
FUNCTION PRINT_COUNTERS(E0, C0, E1, C1)
• The event specifiers e0 and e1 can also be set (or overruled) at run
time by the environment variables
– T5_EVENT0 and T5_EVENT1
• link with -lperfex
• the counter values c0 and c1 are 64 bit integers
TM
TM
perfex Measurements of matmul
The C matrix multiply program has been compiled with
-mips4 -O3 -OPT:alias=restrict -mp
To measure MEM resident performance, use large pages:
setenv PAGESIZE_STACK 1024K
setenv PAGESIZE_DATA 1024K
}Enabled when-mp is used
Statistics
========residence ========
L1
L2
MEM
(N=20)
(N=500)
(N=2000)
Graduated instructions/cycle................
1.85
1.82
1.71
Graduated floating point instructions/cycle.…
0.80
0.94
0.88
Graduated loads & stores/cycle............…
0.64
0.54
0.50
Graduated loads & stores/floating point instr
0.81
0.57
0.57
L1 Cache Line Reuse......................... 205758.87
4.91
4.1
L2 Cache Line Reuse..........................
64.92 13343.20
202.52
L1 Data Cache Hit Rate.......................
L2 Data Cache Hit Rate.......................
Time accessing memory/Total time.............
L1--L2 bandwidth used (MB/s, average per proc)
Memory bandwidth used (MB/s, average per proc)
MFLOPS (average per process).................
Cache misses in flight per cycle (average)...
0.99
0.98
0.64
0.05
0.05
239.15
0.00
0.83
0.99
1.32
907.08
0.32
281.70
1.05
0.80
0.99
1.4
977.7
30.4
266.0
1.34
TM
The SpeedShop Profiler
• No recompilation or relinking is necessary
• Multiple performance experiments and views
• The following performance experiments are supported:
– Program Counter (PC) sampling
-pcsamp
• to get exclusive time per subroutine and per source line
• no call stack information
– User Time
-usertime
• to get exclusive & inclusive times per subroutine (call stack)
• no source line information
– Hardware counter profiling
-”event”_hwc
– Communication overhead of MPI
-mpi
– Floating point exception tracing
-fpe
– Tracing of I/O system call
-io
– malloc/free tracing
-heap
– basic-block counting experiment
-ideal
see man speedshop for explanation of the experiments
• usage:
– ssrun -exp [options] program [args]
– prof “ssrun output file”
TM
SpeedShop Experiments
0x000
Time interrupt
10 ms normal experiments (e.g. -pcsamp)
1 ms “fast” experiments (e.g. -fpcsamp)
main
0x100
instructions
sub1
0x1f0
sub2
sub3
0xcf0
0xdf0
0xe00
PC Sampling Experiment:
Record current instruction address
in a histogram (Overhead <1%)
Program Counter (PC) 16 bit wide bins (e.g. -pcsamp)
32 bit wide bins (e.g. -pcsampx)
Usertime Sampling Experiment:
Record all addresses on call stack
(Overhead <5%)
subn
0xf30
Event interrupt
Usage counts
Event dependent thresholds
“fast” experiments exist
16 bit number
or
32 bit number
0x000
histogram
address
0xf30
TM
ssrun: Experiment Recording
ssrun writes out performance measurements into a binary file which
is named after the program:
ssrun -expname options progname prog-arg
here the output will be binary files with the name
progname.expname.ID#pid
where
• m
• p
• f
• e
• s
ID is a letter code as follows:
the master process created by ssrun
slave processes in parallel execution (call to sproc())
slave processes created by a call to fork()
process created by a call to exec()
a system call process
examples:
> mpirun -np 2 ssrun -mpi cgmd
-rw-r--r--
1 zacharov user
237672 Sep 10 13:50 cgmd.mpi.f318839
-rw-r--r--rw-r--r--
1 zacharov user
1 zacharov user
4733496 Sep 10 13:50 cgmd.mpi.f318854
11896 Sep 10 13:50 cgmd.mpi.m318787
> ssrun -usertime cgmd
-rw-r--r--rw-r--r--
1 zacharov user
1 zacharov user
1613562 Sep 10 14:28 cgmd.usertime.m320140
1601776 Sep 10 14:28 cgmd.usertime.p320140
TM
prof: Performance Interpretation
• prof is the command-line interpreter for ssrun experiments
• Feedback file generation for compiler and linker
• Use cvperf(1)(not prof) for heap and MPI trace analysis
prof options:
• -heavy
report the performance per source code line;
order in descending performance measure
• -lines report the performance per source code line;
order in program sequence
• -usage summary of the system resources used by the program
• -gprof show inclusive timing of callers and callees of each
function
• -q n
truncate after the first n routines or lines
• -q n%
truncate after the routine that takes >n% of total time
• -q ncum% truncate after n% of cumulative time
warning: executable should not contain strip-ed parts
TM
PC Sampling Experiment/1
% ssrun -fpcsampx ./program.exe
% prof -lines program.exe.fpcsampx.m1329
------------------------------------------------------------------------SpeedShop profile listing generated Tue Feb 10 17:53:44 1998
prof -lines program.exe.fpcsampx.m1329
program.exe (n32): Target program
fpcsampx: Experiment name
pc,4,1000,0:cu: Marching orders
R10000 / R10010: CPU / FPU
4: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes-From file program.exe.fpcsampx.m1329:
Caliper point 0 at target begin, PID 1329
/usr/users/Examples/speedshop/./program.exe
Caliper point 1 at exit(0)
------------------------------------------------------------------------Summary of statistical PC sampling data (fpcsampx)-26758: Total samples
26.758: Accumulated time (secs.)
1.0: Time per sample (msecs.)
4: Sample bin width (bytes)
------------------------------------------------------------------------Function list, in descending order by time
------------------------------------------------------------------------[index]
secs
%
cum.%
samples function (dso: file, line)
[1]
[2]
[3]
[4]
[5]
[6]
9.180
7.070
6.182
1.968
1.317
1.040
0.001
34.3% 34.3%
26.4% 60.7%
23.1% 83.8%
7.4% 91.2%
4.9% 96.1%
3.9% 100.0%
0.0% 100.0%
9180
7070
6182
1968
1317
1040
1
26.758 100.0% 100.0%
26758
sub1 (prog.exe: program.f, 39)
update_sum (prog.exe: program.f,62)
sub3 (prog.exe: program.f, 112)
irand_ (libftn.so: rand_.c, 62)
rand_ (libftn.so: rand_.c, 67)
ssdemo (prog.exe: program.f, 1)
**OTHER** (excluded DSOs, rld,etc.)
TOTAL
TM
PC Sampling Experiment/2
------------------------------------------------------------------------Line list, in descending order by function-time and then line number
------------------------------------------------------------------------secs
%
cum.%
samples function (dso: file, line)
0.002
0.007
0.930
8.241
0.0
0.0
3.5
30.8
0.0
0.0
3.5
34.3
2
7
930
8241
sub1
sub1
sub1
sub1
(program.exe:
(program.exe:
(program.exe:
(program.exe:
program.f,
program.f,
program.f,
program.f,
49)
50)
54)
55)
0.858
1.263
4.949
3.2
4.7
18.5
37.5
42.2
60.7
858
1263
4949
update_sum (program.exe: program.f, 72)
update_sum (program.exe: program.f, 73)
update_sum (program.exe: program.f, 74)
0.001
1.029
5.152
0.0
3.8
19.3
60.7
64.6
83.8
1
1029
5152
sub3 (program.exe: program.f, 122)
sub3 (program.exe: program.f, 123)
sub3 (program.exe: program.f, 124)
0.785
1.183
2.9
4.4
86.8
91.2
785
1183
irand_ (libftn.so: rand_.c, 62)
irand_ (libftn.so: rand_.c, 63)
0.076
1.241
0.3
4.6
91.5
96.1
76
1241
rand_ (libftn.so: rand_.c, 67)
rand_ (libftn.so: rand_.c, 69)
0.174
0.866
0.7
3.2
96.8
100.0
174
866
ssdemo (program.exe: program.f, 20)
ssdemo (program.exe: program.f, 21)
ssrun: Performance Experiments
(hardware counter)
The R1x000 event counters can be used for performance
analysis. All experiments are described in man ssrun(1)
Useful experiments:
• -[f]dsc_hwc
– Counts overflows of the 2nd level cache miss counter (counter 26)
– the address is recorded if counter reaches 131 counts (29 for “fast”)
• -[f]gfp_hwc
– overflows of the graduated floating point instruction counter (21)
– the address is recorded if counter reaches 32771 counts (6553 for
“fast” experiments)
• -prof_hwctime
– profiles the counter specified by the environment variable
_SPEEDSHOP_HWC_COUNTER_PROF_NUMBER using call-stack sampling
– addresses are recorded based on the overflow of the env variable
_SPEEDSHOP_HWC_COUNTER_OVERFLOW
• -prof_hwc
– same as prof_hwctime for PC sampling with event counter overflow
TM
TM
Hardware Counters Experiment/1
% ssrun -ftlb_hwc program.exe ; prof program.exe.ftlb_hwc.m1335
------------------------------------------------------------------------SpeedShop profile listing generated Tue Feb 10 17:54:23 1998
prof -lines program.exe.ftlb_hwc.m1335
program.exe (n32): Target program
ftlb_hwc: Experiment name
hwc,23,53:cu: Marching orders
R10000 / R10010: CPU / FPU
4: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes-From file program.exe.ftlb_hwc.m1335:
Caliper point 0 at target begin, PID 1335
/usr/users/Examples/speedshop/./program.exe
Caliper point 1 at exit(0)
------------------------------------------------------------------------Summary of R10K perf. counter overflow PC sampling data (ftlb_hwc)-412712: Total samples
TLB misses (23): Counter name (number)
53: Counter overflow value
21873736: Total counts
------------------------------------------------------------------------Function list, in descending order by counts
------------------------------------------------------------------------[index]
counts
%
cum.%
samples function (dso: file, line)
[1]
[2]
[3]
10224919
5252035
5111691
46.7%
24.0%
23.4%
46.7%
70.8%
94.1%
192923
99095
96447
[4]
[5]
[6]
[7]
[8]
[9]
1043729
241097
53
53
53
53
53
4.8%
1.1%
0.0%
0.0%
0.0%
0.0%
0.0%
98.9%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
19693
4549
1
1
1
1
1
21873736 100.0% 100.0%
412712
sub1 (program.exe: program.f, 39)
sub3 (program.exe: program.f, 112)
update_sum (program.exe: program.f, 62)
ssdemo (program.exe: program.f, 1)
rand_ (libftn.so: rand_.c, 67)
__checktraps (libc.so.1: stubfpeexit.c, 3)
exceed_length (libftn.so: wrtfmt.c, 1421)
wrt_I (libftn.so: wrtfmt.c, 102)
do_Lio_com (libftn.so: lio.c, 803)
**OTHER** (includes excluded DSOs, rld, etc.)
TOTAL
TM
Hardware Counters Experiment/2
Use the environment variables (see man speedshop) :
% setenv _SPEEDSHOP_HWC_COUNTER_NUMBER
6
% setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW
100
% ssrun -prof_hwc
./program.exe
% prof -lines program.exe.prof_hwc.m1336
------------------------------------------------------------------------SpeedShop profile listing generated Tue Feb 10 17:54:56 1998
prof -lines program.exe.prof_hwc.m1336
program.exe (n32): Target program
prof_hwc: Experiment name
hwc:cu: Marching orders
R10000 / R10010: CPU / FPU
4: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes-From file program.exe.prof_hwc.m1336:
Caliper point 0 at target begin, PID 1336
/usr/users/Examples/speedshop/./program.exe
Caliper point 1 at exit(0)
------------------------------------------------------------------------Summary of R10K perf. counter overflow PC sampling data (prof_hwc)-2153328: Total samples
Decoded branches (6): Counter name (number)
100: Counter overflow value
215332800: Total counts
etc…
TM
User Time Experiment
SpeedShop profile listing generated Tue Feb 10 17:51:59 1998
prof -gprof program.exe.usertime.m1332
program.exe (n32): Target program
usertime: Experiment name
ut:cu: Marching orders
R10000 / R10010: CPU / FPU
4: Number of CPUs
195: Clock frequency (MHz.)
Experiment notes-From file program.exe.usertime.m1332:
Caliper point 0 at target begin, PID 1332
/usr/users/ruud/Examples/speedshop/./program.exe
Caliper point 1 at exit(0)
------------------------------------------------------------------------Summary of statistical callstack sampling data (usertime)-301: Total Samples
0: Samples with incomplete traceback
9.030: Accumulated Time (secs.)
30.0: Sample interval (msecs.)
------------------------------------------------------------------------Function list, in descending order by exclusive time
------------------------------------------------------------------------[index] excl.secs excl.%
cum.% incl.secs incl.%
samples procedure (dso: file, line)
[4]
2.850 31.6%
31.6%
5.370 59.5%
179 sub1 (program.exe: program.f, 39)
[6]
2.520 27.9%
59.5%
2.520 27.9%
84 update_sum (program.exe: program.f, 62)
[5]
1.980 21.9%
81.4%
3.000 33.2%
100 sub3 (program.exe: program.f, 112)
[8]
0.660
7.3%
88.7%
0.660
7.3%
22 irand_ (libftn.so: rand_.c, 62)
[7]
0.480
5.3%
94.0%
1.140 12.6%
38 rand_ (libftn.so: rand_.c, 67)
[1]
0.390
4.3%
98.3%
9.030 100.0%
301 ssdemo (program.exe: program.f, 1)
[10]
0.090
1.0%
99.3%
0.090
1.0%
3 _write (libc.so.1: write.s, 15)
[16]
0.030
0.3%
99.7%
0.030
0.3%
1 s_wsle64 (libftn.so: lio.c, 132)
[17]
0.030
0.3% 100.0%
0.030
0.3%
1 s_stop (libftn.so: s_stop.c, 54)
[2]
0.000
0.0% 100.0%
9.030 100.0%
301 __start (program.exe: crt1text.s, 103)
[3]
0.000
0.0% 100.0%
9.030 100.0%
301 main (libftn.so: main.c, 76)
[9]
0.000
0.0% 100.0%
0.120
1.3%
4 sub2 (program.exe: program.f, 81)
[11]
0.000
0.0% 100.0%
0.090
1.0%
3 e_wsle (libftn.so: lio.c, 205)
-usertime experiment -gprof
TM
------------------------------------------------------------------------Butterfly function list, in descending order by inclusive time
------------------------------------------------------------------------attrib.% attrib.time
incl.time caller (callsite) [index]
[index]
incl.%
incl.time
self%
self-time
procedure [index]
attrib.% attrib.time
incl.time callee (callsite) [index]
------------------------------------------------------------------------100.0%
9.030
9.030 main (@0x0b35c004; libftn.so: main.c, 97) [3]
[1]
100.0%
9.030
4.3%
0.390
ssdemo [1]
59.5%
5.370
5.370 sub1 (@0x10000ef0; prog.exe: program.f, 27) [4]
33.2%
3.000
3.000 sub3 (@0x10000f30; program.exe: program.f,33) [5]
1.3%
0.120
1.140 rand_ (@0x10000e20; program.exe: program.f, 21) [7]
1.3%
0.120
0.120 sub2 (@0x10000f14; program.exe: program.f, 31) [9]
0.3%
0.030
0.030 sstop (@0x10000f68; program.exe: program.f, 36)[17]
------------------------------------------------------------------------[2]
100.0%
9.030
0.0%
0.000
__start [2]
100.0%
9.030
9.030 main (@0x10000d68; program.exe: crt1text.s, 166)[3]
------------------------------------------------------------------------100.0%
9.030
9.030 start (@0x10000d68; program.exe: crt1text.s,166)[2]
[3]
100.0%
9.030
0.0%
0.000
main [3]
100.0%
9.030
9.030 ssdemo (@0x0b35c004; libftn.so: main.c, 97) [1]
------------------------------------------------------------------------59.5%
5.370
9.030 ssdemo (@0x10000ef0; program.exe: program.f, 27)[1]
[4]
59.5%
5.370
31.6%
2.850
sub1 [4]
27.9%
2.520
2.520 update_sum(@0x10001024;program.exe:program.f,51)[6]
------------------------------------------------------------------------33.2%
3.000
9.030 ssdemo (@0x10000f30; program.exe: program.f, 33)[1]
[5]
33.2%
3.000
21.9%
1.980
sub3 [5]
11.3%
1.020
1.140 rand_ (@0x10001470; program.exe: program.f, 124)[7]
------------------------------------------------------------------------27.9%
2.520
5.370 sub1 (@0x10001024; program.exe: program.f, 51) [4]
[6]
27.9%
2.520
27.9%
2.520
update_sum [6]
------------------------------------------------------------------------1.3%
0.120
9.030 ssdemo (@0x10000e20; program.exe: program.f, 21)[1]
11.3%
1.020
3.000 sub3 (@0x10001470; program.exe: program.f, 124) [5]
[7]
12.6%
1.140
5.3%
0.480
rand_ [7]
7.3%
0.660
0.660 irand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [8]
------------------------------------------------------------------------7.3%
0.660
1.140 rand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [7]
[8]
7.3%
0.660
7.3%
0.660
irand_ [8]
-------------------------------------------------------------------------
TM
SpeedShop Basic Tools
• ssrun
– basic performance data generation tool
– outputs binary file with the performance experiment results per program or execution
thread
• prof
– human readable interpretation of the ssrun binary files
• sscord/ssorder and sswsextr
– used to extract the working set information from a program to re-order the program
subroutines to pack frequently executed instructions on same memory pages (reduces
TLB miss for instr.)
• ssusage:
ssusage program prog-args
– similar to time, but gives resident size and I/O characteristics
• squeeze and thrash
– change amount of memory available to the user program
• ssdump
– dump the binary performance data files in ASCII format
• fbdump
– writes compiler feedback files from prof
• ssaggregate
– combines multiple SpeedShop binary performance data files into one
TM
SpeedShop Performance Views
• Function list, program graph, butterfly
• Source code with performance annotation
• Disassembly and basic blocks
• Architectural information
• Usage data (resource consumption on the machine)
• Heap analysis and memory leaks
• Working sets
• MP overhead view (Pthreads, MPI, OpenMP)
WorkShop cvperf Views
TM
TM
Summary
There is a large set of performance measurement tools:
• to collect the running status of the machine
• to collect the resource usage of individual programs
There is a large set of performance analysis tools:
• perfex, based on the processor event counters, valid for
evaluation of complete programs or parts of programs
• SpeedShop tools for command-line evaluation of the
performance data
Download