TM Program Development Tools SGI Performance Tools TM Timing Mechanisms •External timers provided by IRIX – timex – time – ssusage •Internal Timers TM Timing Mechanisms TM The Event Counters • The MIPS R1x000 cpu has on-chip event counters that are used to monitor processor activity – there are 2 counters in R10000 processor – each counter has a selector for the events to be counted: • events • events 0 - 15 16 - 31 – see man r10k_counters(5) for a complete description • the R1x000 software counters are available through: – the IRIX perfex command – the profiling environment – a procedural interface (Fortran and C) TM The perfex Command perfex [-e e0][-e e1][-a][-mp][-y][-s][-x] cmd [args] where – e0 and e1 are the event identifiers; should belong to non-overlapping groups (i.e. 0-15 one and 16-31 another) – -a multiplex all events – – – – -mp report counts for each thread, as well as totals -y give timing estimates for each count and general statistics -s start/stop counting on the USR1/USR2 signal -x count events on exception as well examples: perfex -e 2 -e 25 ./swp ………………. output from swp execution Issued loads ……………………………………………………. 13847619 Primary data cache misses …………………. 2310349 perfex -e 3 -e 26 ./swp ………………. output from swp execution Issued stores ……………………………………………………. 9176152 Secondary data cache misses ………………. 9029 Interpretation of event counters requires a model! TM perfex -t: the Model Cost Table Costs for IP27 processor MIPS R12000 CPU Cost Typical Minimum Maximum Event Counter Name ============================================================================================ 0 Cycles............................................. 1.00 clks 1.00 clks 1.00 clks 1 Decoded instructions............................... 0.00 clks 0.00 clks 1.00 clks 2 Decoded loads...................................... 1.00 clks 1.00 clks 1.00 clks 3 Decoded stores..................................... 1.00 clks 1.00 clks 1.00 clks 4 Miss handling table occupancy...................... 0.00 clks 0.00 clks 0.00 clks 5 Failed store conditionals.......................... 1.00 clks 1.00 clks 1.00 clks 6 Resolved conditional branches...................... 1.00 clks 1.00 clks 1.00 clks 7 Quadwords written back from scache........... 8.49 clks 5.90 clks 8.77 clks 8 Correctable scache data array ECC errors........... 0.00 clks 0.00 clks 1.00 clks 9 Primary instruction cache misses..............17.01 clks 4.34 clks 17.01 clks 10 Secondary instruction cache misses............99.89 clks 63.03 clks 99.89 clks 11 Instr misprediction from scache way prediction.. .. 0.00 clks 0.00 clks 1.00 clks 12 External interventions............................. 0.00 clks 0.00 clks 0.00 clks 13 External invalidations............................. 0.00 clks 0.00 clks 0.00 clks 14 ALU/FPU progress cycles............................ 1.00 clks 1.00 clks 1.00 clks 15 Graduated instructions............................. 0.00 clks 0.00 clks 1.00 clks 16 Executed prefetch instructions..................... 0.00 clks 0.00 clks 0.00 clks 17 Prefetch primary data cache misses................. 0.00 clks 0.00 clks 1.00 clks 18 Graduated loads.................................... 1.00 clks 1.00 clks 1.00 clks 19 Graduated stores................................... 1.00 clks 1.00 clks 1.00 clks 20 Graduated store conditionals....................... 1.00 clks 1.00 clks 1.00 clks 21 Graduated floating point instructions........ 1.00 clks 0.50 clks 52.00 clks 22 Quadwords written back from primary data cache..... 3.98 clks 3.14 clks 3.98 clks 23 TLB misses........................................77.78 clks 77.78 clks 77.78 clks 24 Mispredicted branches.............................. 7.28 clks 6.00 clks 8.81 clks 25 Primary data cache misses...................... 8.50 clks 2.17 clks 8.50 clks 26 Secondary data cache misses....................99.89 clks 63.03 clks 99.89 clks 27 Data misprediction from scache way prediction table 0.00 clks 0.00 clks 1.00 clks 28 State of intervention hits in scache.........……………. 0.00 clks 0.00 clks 0.00 clks 29 State of invalidation hits in scache............... 0.00 clks 0.00 clks 0.00 clks 30 Store/prefetch exclusive to clean block in scache.. 1.00 clks 1.00 clks 1.00 clks 31 Store/prefetch exclusive to shared block in scache. 1.00 clks 1.00 clks 1.00 clks TM perfex -a -y: Multiplexing Events Event Counter Name Based on 300 MHz IP27 MIPS R12000 CPU Typical Minimum Maximum Counter Value Time (sec) Time (sec) Time (sec) ========================================================================================================= 0 Cycles........................................... 5019528970688 16731.763236 16731.763236 16731.763236 16 Executed prefetch instructions................………... 8606259264 0.000000 0.000000 0.000000 4 Miss handling table occupancy.................….. 5443332412048 18144.441373 18144.441373 18144.441373 18 Graduated loads...............................…... 938555860624 3128.519535 3128.519535 3128.519535 2 Decoded loads..................................... 935411271824 3118.037573 3118.037573 3118.037573 25 Primary data cache misses......................... 83185778832 2356.930400 601.710467 2356.930400 21 Graduated floating point instructions............. 519085469744 1730.284899 865.142450 89974.814756 3 Decoded stores.................................... 413277338336 1377.591128 1377.591128 1377.591128 19 Graduated stores.................................. 411214746144 1370.715820 1370.715820 1370.715820 22 Quadwords written back from primary data cache.... 63439898464 841.635986 664.004271 841.635986 6 Resolved conditional branches..................... 231879145424 772.930485 772.930485 772.930485 26 Secondary data cache misses....................... 1951064336 649.639388 409.918617 649.639388 7 Quadwords written back from scache................ 13451750864 380.684549 264.551100 393.239517 23 TLB misses........................................ 1238242032 321.034884 321.034884 321.034884 9 Primary instruction cache misses.................. 1719354640 97.487408 24.873330 97.487408 24 Mispredicted branches............................. 3744555216 90.867873 74.891104 109.965105 10 Secondary instruction cache misses................ 26286992 8.752692 5.522897 8.752692 31 Store/prefetch exclusive to shared block in scache 1435428608 4.784762 4.784762 4.784762 20 Graduated store conditionals...................... 639716576 2.132389 2.132389 2.132389 5 Failed store conditionals......................... 54079168 0.180264 0.180264 0.180264 30 Store/prefetch exclusive to clean block in scache. 20579264 0.068598 0.068598 0.068598 1 Decoded instructions............................. 3312748123920 0.000000 0.000000 11042.493746 8 Correctable scache data array ECC errors......… 0 0.000000 0.000000 0.000000 11 Instruction mispred from scache way prediction table..166448576 0.000000 0.000000 0.554829 12 External interventions............................ 1584014720 0.000000 0.000000 0.000000 13 External invalidations............................ 3304023728 0.000000 0.000000 0.000000 14 ALU/FPU progress cycles........................... 0 0.000000 0.000000 0.000000 15 Graduated instructions........................... 3268746182224 0.000000 0.000000 10895.820607 … TM perfex -y Statistics (R12000) Statistics ========================================================================================= Graduated instructions/cycle................................................ 0.651206 Graduated floating point instructions/cycle................................. 0.103413 Graduated loads & stores/cycle.............................................. 0.268904 Graduated loads & stores/floating point instruction......................... 2.600286 Mispredicted branches/Resolved conditional branches......................... 0.016149 Graduated loads /Decoded loads ( and prefetches )........................... 0.994214 Graduated stores/Decoded stores............................................. 0.995009 Data mispredict/Data scache hits............................................ 0.036721 Instruction mispredict/Instruction scache hits.............................. 0.098312 L1 Cache Line Reuse......................................................... 15.225978 L2 Cache Line Reuse......................................................... 41.636102 L1 Data Cache Hit Rate...................................................... 0.938370 L2 Data Cache Hit Rate...................................................... 0.976546 Time accessing memory/Total time............................................ 0.467783 L1--L2 bandwidth used (MB/s, average per process)........................... 219.760658 Memory bandwidth used (MB/s, average per process)........................... 27.789316 MFLOPS (average per process)................................................ 31.023955 Cache misses in flight per cycle (average).................................. 1.084431 Prefetch cache miss rate.................................................... 0.606133 TM Explanation of Statistics/1 • L1 cache line reuse: should be large >>1 – it is the number of times, on average that a primary data cache line is used after it has been moved into the cache L1 cache line reuse = loads + stores - misses misses • L2 cache line reuse: should be large >>1 – it is the average number of times the cache line was used after it moved into the cache L2 cache line reuse = L1 cache misses - L2 cache misses L2 cache misses • L1 data cache hit rate: should be ~1 – the fraction of data accesses that finds data already resident in L1 L1 data cache hit rate = 1- L1 cache misses loads + stores • L2 data cache hit rate: should be ~1 – the fraction of data accesses that finds data already resident in L2 L2 data cache hit rate = 1- L2 cache misses L1 cache misses TM Explanation of Statistics/2 • Instruction or data mispredictions/hits: should be small (~0) • L1-L2 bandwidth used ([MB/s] average per process) – total number of bytes moved between L1 and L2 divided by total run time #bytes moved = (L1 cache miss * L1 cache line size + Quadwords written back *16 bytes) • Memory bandwidth used ([MB/s] average per process) – total number of bytes moved between L2 an memory divided by total time #bytes moved = (L2 cache miss * L2 cache line size + Quadwords written back *16 bytes) • Performance Mflop/s: larger is better – it is computed from the #fp instructions divided by total run time – the madd instruction counts as 1, so this statistic usually greatly underestimates the performance in Mflop/s • Time accessing memory/Total time: should be small (~0) T(loads+stores+L1 miss+L2 miss+TLB miss) total run time TM Explanation of Statistics/3 The R12000 processor gives additional information: • Miss handling table occupancy: – average number of entries in MHT. A higher value indicates more outstanding cache misses. An interesting observation is to get Mh= MHT cycles = 0 .. 5 – it should be >1, since if it is close to 1 there might be dependency in the data that prevents the processor to issue multiple outstanding cache misses – it could be <1 for code that fits into the cache – high values for Mh while the memory bandwidth is low would indicate a lot of L1 misses into the same L2 cache line • Prefetch miss rate: – rate of prefetch instructions missing in the first level cache L1 miss prefetch(17) Rp= #prefetch instructions(16) = 0 .. 1 – should be close to 1: Every prefetch to be effective should miss in the cache. TM R1x000 Important Events R10000 Event Description 0 7 9 10 12 13 14 15 18 19 21 22 23 25 26 28 29 30 31 R12000 Event Desctiption Cycles 4 Miss handling table occupancy Quadwords written back from L2 L1 instruction cache misses L2 instruction cache misses External interventions External invalidations ALU/FPU forward progress cycles Graduated instructions 16 Prefetch instructions Graduated loads Graduated stores Graduated fp instructions Quadwords written back from L1 TLB misses L1 data cache misses L2 data cache misses External intervention hits in L2 External invalidation hits in L2 Store/prefetch exclusive to clean block in L2 Store/prefetch exclusive to shared block in L2 TM Cache Coherency Counters • The directory cache coherency protocol sends signals to nodes that requested data from memory in the past • Invalidations are transactions that convert cache lines from the “shared” to invalid state • Interventions are transactions that convert cache line from the exclusive to “shared” or “invalid” (depending on whether the requestor asked for “shared” or “exclusive” access) • Counters 12 and 13 tell how many interventions and invalidations have been received by the processor; counters 28 and 29 tell how many of them actually hit in the cache • Invalidations and/or interventions are sent to a node because the cache line owner thinks that node has the cache line. However, if that processor has dropped the cache line earlier, the signal will not find the data in the cache and counters 12 or 13 will be updated, while 28 or 29 will not. TM Explanation: Useful Counters • Cycles are proportional to time spend on the cpu – spin times are added, but wait times are not • Graduated instructions divided by cycles will give the IPC measure for the program. The R1x000 processor can process up to 4 IPC, a value of IPC >1.5 should be considered good. • ALU/FPU progress cycles are the actual computation on the processor • TLB misses is counting how many times the virtual translation entry has not been found in the Translation Lookaside Table. If that is a large contribution, larger size pages might be used in the application Event Counters Procedural Interface C synopsis int start_counters(int e0, int e1); int read_counters( int e0, long long *c0, int e1, long long *c1); int print_counters(int e0, long long c0, int e1, long long c1); Fortran synopsis INTEGER*8 INTEGER*4 INTEGER*4 INTEGER*4 INTEGER*4 C0, C1 E0, E1 FUNCTION START_COUNTERS(E0, E1) FUNCTION READ_COUNTERS(E0, C0, E1, C1) FUNCTION PRINT_COUNTERS(E0, C0, E1, C1) • The event specifiers e0 and e1 can also be set (or overruled) at run time by the environment variables – T5_EVENT0 and T5_EVENT1 • link with -lperfex • the counter values c0 and c1 are 64 bit integers TM TM perfex Measurements of matmul The C matrix multiply program has been compiled with -mips4 -O3 -OPT:alias=restrict -mp To measure MEM resident performance, use large pages: setenv PAGESIZE_STACK 1024K setenv PAGESIZE_DATA 1024K }Enabled when-mp is used Statistics ========residence ======== L1 L2 MEM (N=20) (N=500) (N=2000) Graduated instructions/cycle................ 1.85 1.82 1.71 Graduated floating point instructions/cycle.… 0.80 0.94 0.88 Graduated loads & stores/cycle............… 0.64 0.54 0.50 Graduated loads & stores/floating point instr 0.81 0.57 0.57 L1 Cache Line Reuse......................... 205758.87 4.91 4.1 L2 Cache Line Reuse.......................... 64.92 13343.20 202.52 L1 Data Cache Hit Rate....................... L2 Data Cache Hit Rate....................... Time accessing memory/Total time............. L1--L2 bandwidth used (MB/s, average per proc) Memory bandwidth used (MB/s, average per proc) MFLOPS (average per process)................. Cache misses in flight per cycle (average)... 0.99 0.98 0.64 0.05 0.05 239.15 0.00 0.83 0.99 1.32 907.08 0.32 281.70 1.05 0.80 0.99 1.4 977.7 30.4 266.0 1.34 TM The SpeedShop Profiler • No recompilation or relinking is necessary • Multiple performance experiments and views • The following performance experiments are supported: – Program Counter (PC) sampling -pcsamp • to get exclusive time per subroutine and per source line • no call stack information – User Time -usertime • to get exclusive & inclusive times per subroutine (call stack) • no source line information – Hardware counter profiling -”event”_hwc – Communication overhead of MPI -mpi – Floating point exception tracing -fpe – Tracing of I/O system call -io – malloc/free tracing -heap – basic-block counting experiment -ideal see man speedshop for explanation of the experiments • usage: – ssrun -exp [options] program [args] – prof “ssrun output file” TM SpeedShop Experiments 0x000 Time interrupt 10 ms normal experiments (e.g. -pcsamp) 1 ms “fast” experiments (e.g. -fpcsamp) main 0x100 instructions sub1 0x1f0 sub2 sub3 0xcf0 0xdf0 0xe00 PC Sampling Experiment: Record current instruction address in a histogram (Overhead <1%) Program Counter (PC) 16 bit wide bins (e.g. -pcsamp) 32 bit wide bins (e.g. -pcsampx) Usertime Sampling Experiment: Record all addresses on call stack (Overhead <5%) subn 0xf30 Event interrupt Usage counts Event dependent thresholds “fast” experiments exist 16 bit number or 32 bit number 0x000 histogram address 0xf30 TM ssrun: Experiment Recording ssrun writes out performance measurements into a binary file which is named after the program: ssrun -expname options progname prog-arg here the output will be binary files with the name progname.expname.ID#pid where • m • p • f • e • s ID is a letter code as follows: the master process created by ssrun slave processes in parallel execution (call to sproc()) slave processes created by a call to fork() process created by a call to exec() a system call process examples: > mpirun -np 2 ssrun -mpi cgmd -rw-r--r-- 1 zacharov user 237672 Sep 10 13:50 cgmd.mpi.f318839 -rw-r--r--rw-r--r-- 1 zacharov user 1 zacharov user 4733496 Sep 10 13:50 cgmd.mpi.f318854 11896 Sep 10 13:50 cgmd.mpi.m318787 > ssrun -usertime cgmd -rw-r--r--rw-r--r-- 1 zacharov user 1 zacharov user 1613562 Sep 10 14:28 cgmd.usertime.m320140 1601776 Sep 10 14:28 cgmd.usertime.p320140 TM prof: Performance Interpretation • prof is the command-line interpreter for ssrun experiments • Feedback file generation for compiler and linker • Use cvperf(1)(not prof) for heap and MPI trace analysis prof options: • -heavy report the performance per source code line; order in descending performance measure • -lines report the performance per source code line; order in program sequence • -usage summary of the system resources used by the program • -gprof show inclusive timing of callers and callees of each function • -q n truncate after the first n routines or lines • -q n% truncate after the routine that takes >n% of total time • -q ncum% truncate after n% of cumulative time warning: executable should not contain strip-ed parts TM PC Sampling Experiment/1 % ssrun -fpcsampx ./program.exe % prof -lines program.exe.fpcsampx.m1329 ------------------------------------------------------------------------SpeedShop profile listing generated Tue Feb 10 17:53:44 1998 prof -lines program.exe.fpcsampx.m1329 program.exe (n32): Target program fpcsampx: Experiment name pc,4,1000,0:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-From file program.exe.fpcsampx.m1329: Caliper point 0 at target begin, PID 1329 /usr/users/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------Summary of statistical PC sampling data (fpcsampx)-26758: Total samples 26.758: Accumulated time (secs.) 1.0: Time per sample (msecs.) 4: Sample bin width (bytes) ------------------------------------------------------------------------Function list, in descending order by time ------------------------------------------------------------------------[index] secs % cum.% samples function (dso: file, line) [1] [2] [3] [4] [5] [6] 9.180 7.070 6.182 1.968 1.317 1.040 0.001 34.3% 34.3% 26.4% 60.7% 23.1% 83.8% 7.4% 91.2% 4.9% 96.1% 3.9% 100.0% 0.0% 100.0% 9180 7070 6182 1968 1317 1040 1 26.758 100.0% 100.0% 26758 sub1 (prog.exe: program.f, 39) update_sum (prog.exe: program.f,62) sub3 (prog.exe: program.f, 112) irand_ (libftn.so: rand_.c, 62) rand_ (libftn.so: rand_.c, 67) ssdemo (prog.exe: program.f, 1) **OTHER** (excluded DSOs, rld,etc.) TOTAL TM PC Sampling Experiment/2 ------------------------------------------------------------------------Line list, in descending order by function-time and then line number ------------------------------------------------------------------------secs % cum.% samples function (dso: file, line) 0.002 0.007 0.930 8.241 0.0 0.0 3.5 30.8 0.0 0.0 3.5 34.3 2 7 930 8241 sub1 sub1 sub1 sub1 (program.exe: (program.exe: (program.exe: (program.exe: program.f, program.f, program.f, program.f, 49) 50) 54) 55) 0.858 1.263 4.949 3.2 4.7 18.5 37.5 42.2 60.7 858 1263 4949 update_sum (program.exe: program.f, 72) update_sum (program.exe: program.f, 73) update_sum (program.exe: program.f, 74) 0.001 1.029 5.152 0.0 3.8 19.3 60.7 64.6 83.8 1 1029 5152 sub3 (program.exe: program.f, 122) sub3 (program.exe: program.f, 123) sub3 (program.exe: program.f, 124) 0.785 1.183 2.9 4.4 86.8 91.2 785 1183 irand_ (libftn.so: rand_.c, 62) irand_ (libftn.so: rand_.c, 63) 0.076 1.241 0.3 4.6 91.5 96.1 76 1241 rand_ (libftn.so: rand_.c, 67) rand_ (libftn.so: rand_.c, 69) 0.174 0.866 0.7 3.2 96.8 100.0 174 866 ssdemo (program.exe: program.f, 20) ssdemo (program.exe: program.f, 21) ssrun: Performance Experiments (hardware counter) The R1x000 event counters can be used for performance analysis. All experiments are described in man ssrun(1) Useful experiments: • -[f]dsc_hwc – Counts overflows of the 2nd level cache miss counter (counter 26) – the address is recorded if counter reaches 131 counts (29 for “fast”) • -[f]gfp_hwc – overflows of the graduated floating point instruction counter (21) – the address is recorded if counter reaches 32771 counts (6553 for “fast” experiments) • -prof_hwctime – profiles the counter specified by the environment variable _SPEEDSHOP_HWC_COUNTER_PROF_NUMBER using call-stack sampling – addresses are recorded based on the overflow of the env variable _SPEEDSHOP_HWC_COUNTER_OVERFLOW • -prof_hwc – same as prof_hwctime for PC sampling with event counter overflow TM TM Hardware Counters Experiment/1 % ssrun -ftlb_hwc program.exe ; prof program.exe.ftlb_hwc.m1335 ------------------------------------------------------------------------SpeedShop profile listing generated Tue Feb 10 17:54:23 1998 prof -lines program.exe.ftlb_hwc.m1335 program.exe (n32): Target program ftlb_hwc: Experiment name hwc,23,53:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-From file program.exe.ftlb_hwc.m1335: Caliper point 0 at target begin, PID 1335 /usr/users/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------Summary of R10K perf. counter overflow PC sampling data (ftlb_hwc)-412712: Total samples TLB misses (23): Counter name (number) 53: Counter overflow value 21873736: Total counts ------------------------------------------------------------------------Function list, in descending order by counts ------------------------------------------------------------------------[index] counts % cum.% samples function (dso: file, line) [1] [2] [3] 10224919 5252035 5111691 46.7% 24.0% 23.4% 46.7% 70.8% 94.1% 192923 99095 96447 [4] [5] [6] [7] [8] [9] 1043729 241097 53 53 53 53 53 4.8% 1.1% 0.0% 0.0% 0.0% 0.0% 0.0% 98.9% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 19693 4549 1 1 1 1 1 21873736 100.0% 100.0% 412712 sub1 (program.exe: program.f, 39) sub3 (program.exe: program.f, 112) update_sum (program.exe: program.f, 62) ssdemo (program.exe: program.f, 1) rand_ (libftn.so: rand_.c, 67) __checktraps (libc.so.1: stubfpeexit.c, 3) exceed_length (libftn.so: wrtfmt.c, 1421) wrt_I (libftn.so: wrtfmt.c, 102) do_Lio_com (libftn.so: lio.c, 803) **OTHER** (includes excluded DSOs, rld, etc.) TOTAL TM Hardware Counters Experiment/2 Use the environment variables (see man speedshop) : % setenv _SPEEDSHOP_HWC_COUNTER_NUMBER 6 % setenv _SPEEDSHOP_HWC_COUNTER_OVERFLOW 100 % ssrun -prof_hwc ./program.exe % prof -lines program.exe.prof_hwc.m1336 ------------------------------------------------------------------------SpeedShop profile listing generated Tue Feb 10 17:54:56 1998 prof -lines program.exe.prof_hwc.m1336 program.exe (n32): Target program prof_hwc: Experiment name hwc:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-From file program.exe.prof_hwc.m1336: Caliper point 0 at target begin, PID 1336 /usr/users/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------Summary of R10K perf. counter overflow PC sampling data (prof_hwc)-2153328: Total samples Decoded branches (6): Counter name (number) 100: Counter overflow value 215332800: Total counts etc… TM User Time Experiment SpeedShop profile listing generated Tue Feb 10 17:51:59 1998 prof -gprof program.exe.usertime.m1332 program.exe (n32): Target program usertime: Experiment name ut:cu: Marching orders R10000 / R10010: CPU / FPU 4: Number of CPUs 195: Clock frequency (MHz.) Experiment notes-From file program.exe.usertime.m1332: Caliper point 0 at target begin, PID 1332 /usr/users/ruud/Examples/speedshop/./program.exe Caliper point 1 at exit(0) ------------------------------------------------------------------------Summary of statistical callstack sampling data (usertime)-301: Total Samples 0: Samples with incomplete traceback 9.030: Accumulated Time (secs.) 30.0: Sample interval (msecs.) ------------------------------------------------------------------------Function list, in descending order by exclusive time ------------------------------------------------------------------------[index] excl.secs excl.% cum.% incl.secs incl.% samples procedure (dso: file, line) [4] 2.850 31.6% 31.6% 5.370 59.5% 179 sub1 (program.exe: program.f, 39) [6] 2.520 27.9% 59.5% 2.520 27.9% 84 update_sum (program.exe: program.f, 62) [5] 1.980 21.9% 81.4% 3.000 33.2% 100 sub3 (program.exe: program.f, 112) [8] 0.660 7.3% 88.7% 0.660 7.3% 22 irand_ (libftn.so: rand_.c, 62) [7] 0.480 5.3% 94.0% 1.140 12.6% 38 rand_ (libftn.so: rand_.c, 67) [1] 0.390 4.3% 98.3% 9.030 100.0% 301 ssdemo (program.exe: program.f, 1) [10] 0.090 1.0% 99.3% 0.090 1.0% 3 _write (libc.so.1: write.s, 15) [16] 0.030 0.3% 99.7% 0.030 0.3% 1 s_wsle64 (libftn.so: lio.c, 132) [17] 0.030 0.3% 100.0% 0.030 0.3% 1 s_stop (libftn.so: s_stop.c, 54) [2] 0.000 0.0% 100.0% 9.030 100.0% 301 __start (program.exe: crt1text.s, 103) [3] 0.000 0.0% 100.0% 9.030 100.0% 301 main (libftn.so: main.c, 76) [9] 0.000 0.0% 100.0% 0.120 1.3% 4 sub2 (program.exe: program.f, 81) [11] 0.000 0.0% 100.0% 0.090 1.0% 3 e_wsle (libftn.so: lio.c, 205) -usertime experiment -gprof TM ------------------------------------------------------------------------Butterfly function list, in descending order by inclusive time ------------------------------------------------------------------------attrib.% attrib.time incl.time caller (callsite) [index] [index] incl.% incl.time self% self-time procedure [index] attrib.% attrib.time incl.time callee (callsite) [index] ------------------------------------------------------------------------100.0% 9.030 9.030 main (@0x0b35c004; libftn.so: main.c, 97) [3] [1] 100.0% 9.030 4.3% 0.390 ssdemo [1] 59.5% 5.370 5.370 sub1 (@0x10000ef0; prog.exe: program.f, 27) [4] 33.2% 3.000 3.000 sub3 (@0x10000f30; program.exe: program.f,33) [5] 1.3% 0.120 1.140 rand_ (@0x10000e20; program.exe: program.f, 21) [7] 1.3% 0.120 0.120 sub2 (@0x10000f14; program.exe: program.f, 31) [9] 0.3% 0.030 0.030 sstop (@0x10000f68; program.exe: program.f, 36)[17] ------------------------------------------------------------------------[2] 100.0% 9.030 0.0% 0.000 __start [2] 100.0% 9.030 9.030 main (@0x10000d68; program.exe: crt1text.s, 166)[3] ------------------------------------------------------------------------100.0% 9.030 9.030 start (@0x10000d68; program.exe: crt1text.s,166)[2] [3] 100.0% 9.030 0.0% 0.000 main [3] 100.0% 9.030 9.030 ssdemo (@0x0b35c004; libftn.so: main.c, 97) [1] ------------------------------------------------------------------------59.5% 5.370 9.030 ssdemo (@0x10000ef0; program.exe: program.f, 27)[1] [4] 59.5% 5.370 31.6% 2.850 sub1 [4] 27.9% 2.520 2.520 update_sum(@0x10001024;program.exe:program.f,51)[6] ------------------------------------------------------------------------33.2% 3.000 9.030 ssdemo (@0x10000f30; program.exe: program.f, 33)[1] [5] 33.2% 3.000 21.9% 1.980 sub3 [5] 11.3% 1.020 1.140 rand_ (@0x10001470; program.exe: program.f, 124)[7] ------------------------------------------------------------------------27.9% 2.520 5.370 sub1 (@0x10001024; program.exe: program.f, 51) [4] [6] 27.9% 2.520 27.9% 2.520 update_sum [6] ------------------------------------------------------------------------1.3% 0.120 9.030 ssdemo (@0x10000e20; program.exe: program.f, 21)[1] 11.3% 1.020 3.000 sub3 (@0x10001470; program.exe: program.f, 124) [5] [7] 12.6% 1.140 5.3% 0.480 rand_ [7] 7.3% 0.660 0.660 irand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [8] ------------------------------------------------------------------------7.3% 0.660 1.140 rand_ (@0x0b340f6c; libftn.so: rand_.c, 69) [7] [8] 7.3% 0.660 7.3% 0.660 irand_ [8] ------------------------------------------------------------------------- TM SpeedShop Basic Tools • ssrun – basic performance data generation tool – outputs binary file with the performance experiment results per program or execution thread • prof – human readable interpretation of the ssrun binary files • sscord/ssorder and sswsextr – used to extract the working set information from a program to re-order the program subroutines to pack frequently executed instructions on same memory pages (reduces TLB miss for instr.) • ssusage: ssusage program prog-args – similar to time, but gives resident size and I/O characteristics • squeeze and thrash – change amount of memory available to the user program • ssdump – dump the binary performance data files in ASCII format • fbdump – writes compiler feedback files from prof • ssaggregate – combines multiple SpeedShop binary performance data files into one TM SpeedShop Performance Views • Function list, program graph, butterfly • Source code with performance annotation • Disassembly and basic blocks • Architectural information • Usage data (resource consumption on the machine) • Heap analysis and memory leaks • Working sets • MP overhead view (Pthreads, MPI, OpenMP) WorkShop cvperf Views TM TM Summary There is a large set of performance measurement tools: • to collect the running status of the machine • to collect the resource usage of individual programs There is a large set of performance analysis tools: • perfex, based on the processor event counters, valid for evaluation of complete programs or parts of programs • SpeedShop tools for command-line evaluation of the performance data