Understanding Performance Counter Data

advertisement
Understanding Performance Counter Data
Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam
Mentor: Dr. Patricia J. Teller
What are performance counters?
Performance counters are a special set of registers, most commonly found on the processor’s chip, which count a number of
different events that occur inside the computer when an application is being executed.
Why are performance counters important?
The events counted provide data that can be used to assess the performance of an application executed on a particular processor. Performance analysis helps
programmers find bottlenecks in their programs, allowing them to build more efficient applications.
RECENT WORK
DoD applications programmers working on the IBM Power 3 computer architecture had
some questions pertaining to performance counter data for that specific architecture. The
questions were forwarded to our team by collaborators at UTK (University of TennesseeKnoxville), and we took on the task of answering any questions not answered by UTK.
4. Same question as 3, except for L2 cache. This question also is complicated by the fact that the L2 cache is
unified (data and instruction) (I think). If this is true, how do instruction prefetches fit into the calculation?
LIST OF QUESTIONS
5. What (on earth) is the difference between the events PM_LD_MISS_L1, PM_LD_MISS_EXCEED_L2, and
PM_LS_MISS_EXCEED_NO_L2? Also, the latter two events take a "threshold" as an argument; how do you
specify this to PAPI?
1. Exactly how are floating-point (FP) operations counted? (What is counted?) We
have observed that FP loads & stores are not counted, and that FMAs (FP Multiply-Adds)
are counted as one FP op. Also FP round-to-single is counted as one FP op. Are divides
and SQRTs (square roots) included? Are SQRTs counted as one FP op?
6. On the POWER3 SP, does the sum of PM_FPU_FADD_FMUL + PM_FPU_FCMP + PM_FPU_FDIV
+ PM_FPU_FEST + PM_FPU_FMA + PM_FPU_FPSCR + PM_FPU_FRSP + PM_FPU_FSQRT
equal
PM_FPU0_CMPL + PM_FPU1_CMPL?
2. Kevin London said that PAPI_L1_DCH will return L1 data cache (DC) hits;
however, it (that event) is not available on the Power3. We can derive L1 DC hits using
"total references to L1 DC" minus number of L1 misses (PAPI_L1_DCM). How do we
get total L1 references? Obviously, we should include number of loads
(PAPI_LD_CMPL) plus number of stores (PAPI_ST_CMPL), but do we count prefetches
(data fetched speculatively)?
3. Are prefetches already part of the load count? (Probably shouldn't be since the result
goes to cache, but not to a register.) Are prefetches part of the L1 miss count?
Apparently there is a counter for prefetch hits (PM_DC_PREF_HIT). Should the hit rate
be calculated PAPI_L1_DCH / (PAPI_LD_INS + PAPI_ST_INS) or (PAPI_L1_DCH +
PM_DC_PREF_HIT) / (PAPI_LD_INS + PAPI_ST_INS + number of prefetches). If the
latter, how do you count number of prefetches?
7. On the POWER3 SP, does PM_IC_HIT + PM_IC_MISS equal PM_INST_CMPL or
PM_INST_DISP?
8. More generally, on a speculative processor, there will be more instructions dispatched than completed, and at
some point some instructions will be cancelled (is this correct?). Are instructions cancelled before or after they
touch cache? This is important in calculating the cache miss rate, since (hopefully) the miss rate is either
(PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_CMPL + PM_ST_CMPL)
or (PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_DISP + PM_ST_DISP). Which is it?
POWER 3 processor die
What is being counted in floating-point operations?
Miss rate on L1 and L2 data caches
Load/Store
Machine
• Simple math operations, +,-,*,/, all count as 1 FLOP (floating-point operation).
• A multiply followed by an add, called an FMA instruction, is handled by special
hardware and only counts as 1 FLOP.
• A square root operation (sqrt), when handled by a software routine, counts as 21 or
22 FLOPs.
• The Power3 has special hardware to handle sqrt operations, but the compiler does
not always use the hardware.
• Other operations counted: Rounding operations and register moves.
• Operations not counted: Floating-point data loads and stores.
FPU 0
(Floating-point Unit 0)
• L1 and L2 data cache miss rates are not easy to estimate because of the prefetching mechanism present in almost all
modern processors.
How to calculate cache-miss rate with performance counters on the POWER 3?
• Available events:
• Load miss @ L1
• Load dispatched
• Load completed
• Store miss @ L1
• Store dispatched
• Store completed
• There is a need to research indirect methods to measure L1 and L2 data cache miss rates
PIPELINE
DISPATCHED
COMPLETED
L1 hit rate = (100) * (1- (Load misses in L1 + Store misses) / Total Loads and Stores)
L2 hit rate = (100) * (1 - (Load misses in L2 + Store misses in L2) / Total L1 Misses)
(Speculative Execution: Not all dispatched instructions may get completed)
If FLOP
= SQRT
routine
Counter = Counter + (21or 22)
• Prefetching reduces the miss rate for sequential data access
• Complement of the miss rate can be computed as follows:
Yes
FPU 1
(Floating-point Unit 1)
What tools are used in our research?
PAPI: An API (Application Program Interface) is used to access performance counters. It allows the user to dynamically select
and display performance counter data, i.e., count specific events.
Micro-benchmarks: Small programs designed to use specific computer resources which, in turn, generate events recorded by
performance counters. Due to their simplicity, we can, with sufficient knowledge of the target processor architecture, predict
event counts, which are used to understand the behavior of the performance counters and the underlying architecture.
This metrics were obtained at: http://www.sdsc.edu/SciApps/IBM_tools/ibm_toolkit/HPM_2_4_3.html
Cache-Miss Rate Underestimated
Cache-Miss Rate Overestimated
L1 D cache misses
(LOAD MISS @ L1 + STORE MISS @ L1)
(LOAD CMPL + STORE CMPL)
or
No
FPU
(LOAD MISS @ L1 + STORE MISS @ L1)
(LOAD DISP + STORE DISP)
Itanium
Counter = Counter +1
100.00
% Error
50.00
• Many times an in-depth understanding of the architecture being studied is required to correctly analyze performance data.
• Assuming that the commonly used definition for an event holds for any platform may lead to misinterpretation of the
performance data.
• For instance, in the Power 3, the instruction cache hit event is trigged when a block of instructions (up to 8) is fetched to the
instruction buffer from the cache instead of being triggered on every single fetch of an instruction.
• By experimentation and trying to answer question 7, we found that in the Power 3, the following relation holds for a sequential
program:
FMA
Hardware
0.00
0
200
400
600
800
-50.00
-100.00
-150.00
Expected value
((PM_IC_HIT - IC_PREF_USED) + PM_IC_MISS) * 8  PM_INST_CMPL
Which floating-point operations contribute to total the total floating-point operations completed?
Error Difference with respect to Instructions Completed and Dispatched
Assembler micro-benchmark example
.
# Set up input parameters
.
.
lfd fp1,64(SP) # loading a value into a register
lfd fp2,72(SP) # loading a value into a register
fa
fp3,fp1,fp2 # performing a floating-point
operation on a number of
values
stfd fp3,72(SP) # storing the result of the
floating-point operation in a
register
.
.
.
# More operations
Error w.r.t. instructions completed
Error w.r.t. instructions dispatched
40.00
35.00
30.00
Error (%)
• The micro-benchmarks used had to be written in assembler
due to the difficulty of triggering specific events on a highlevel language such as C.
• A different micro-benchmark was written for each of the
operations tested.
• The micro-benchmarks revealed that the equation previously
thought to give the number of total floating-point operations
does not hold.
• PM_FPU0_FMOV_FEST must be added to the equation.
(The fres instruction, which gives an estimate of the reciprocal
of a floating-point operand, will be counted by the
PM_FPU0_FMOV_FEST event as well as by the
PM_FPU_FEST event. Thus, when using any kind of estimate
instruction the proposed equation will count fres instructions
twice.)
• Division and square root floating-point operations are
counted as FMA operations. (STILL UNDER
INVESTIGATION)
Pentium
150.00
L1 Instruction Cache Hits
SQRT
Hardware
R12k
200.00
Should be better approximation (Recommended by PCAT)
(Floating-point Unit )
Pow er3
25.00
20.00
15.00
10.00
5.00
0.00
32
64
128
256
512
1024
2048
4096
8192
16384
32768
Instructions per Benchmark
SPONSORS
Department of Defense (DOD), MIE (Model Institutions for Excellence) REU (Research Experiences for Undergraduates) Program, and The Dodson Endowment
1000
1200
1400
Download