Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia J. Teller What are performance counters? Performance counters are a special set of registers, most commonly found on the processor’s chip, which count a number of different events that occur inside the computer when an application is being executed. Why are performance counters important? The events counted provide data that can be used to assess the performance of an application executed on a particular processor. Performance analysis helps programmers find bottlenecks in their programs, allowing them to build more efficient applications. RECENT WORK DoD applications programmers working on the IBM Power 3 computer architecture had some questions pertaining to performance counter data for that specific architecture. The questions were forwarded to our team by collaborators at UTK (University of TennesseeKnoxville), and we took on the task of answering any questions not answered by UTK. 4. Same question as 3, except for L2 cache. This question also is complicated by the fact that the L2 cache is unified (data and instruction) (I think). If this is true, how do instruction prefetches fit into the calculation? LIST OF QUESTIONS 5. What (on earth) is the difference between the events PM_LD_MISS_L1, PM_LD_MISS_EXCEED_L2, and PM_LS_MISS_EXCEED_NO_L2? Also, the latter two events take a "threshold" as an argument; how do you specify this to PAPI? 1. Exactly how are floating-point (FP) operations counted? (What is counted?) We have observed that FP loads & stores are not counted, and that FMAs (FP Multiply-Adds) are counted as one FP op. Also FP round-to-single is counted as one FP op. Are divides and SQRTs (square roots) included? Are SQRTs counted as one FP op? 6. On the POWER3 SP, does the sum of PM_FPU_FADD_FMUL + PM_FPU_FCMP + PM_FPU_FDIV + PM_FPU_FEST + PM_FPU_FMA + PM_FPU_FPSCR + PM_FPU_FRSP + PM_FPU_FSQRT equal PM_FPU0_CMPL + PM_FPU1_CMPL? 2. Kevin London said that PAPI_L1_DCH will return L1 data cache (DC) hits; however, it (that event) is not available on the Power3. We can derive L1 DC hits using "total references to L1 DC" minus number of L1 misses (PAPI_L1_DCM). How do we get total L1 references? Obviously, we should include number of loads (PAPI_LD_CMPL) plus number of stores (PAPI_ST_CMPL), but do we count prefetches (data fetched speculatively)? 3. Are prefetches already part of the load count? (Probably shouldn't be since the result goes to cache, but not to a register.) Are prefetches part of the L1 miss count? Apparently there is a counter for prefetch hits (PM_DC_PREF_HIT). Should the hit rate be calculated PAPI_L1_DCH / (PAPI_LD_INS + PAPI_ST_INS) or (PAPI_L1_DCH + PM_DC_PREF_HIT) / (PAPI_LD_INS + PAPI_ST_INS + number of prefetches). If the latter, how do you count number of prefetches? 7. On the POWER3 SP, does PM_IC_HIT + PM_IC_MISS equal PM_INST_CMPL or PM_INST_DISP? 8. More generally, on a speculative processor, there will be more instructions dispatched than completed, and at some point some instructions will be cancelled (is this correct?). Are instructions cancelled before or after they touch cache? This is important in calculating the cache miss rate, since (hopefully) the miss rate is either (PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_CMPL + PM_ST_CMPL) or (PM_LD_MISS_L1 + PM_ST_MISS_L1) / (PM_LD_DISP + PM_ST_DISP). Which is it? POWER 3 processor die What is being counted in floating-point operations? Miss rate on L1 and L2 data caches Load/Store Machine • Simple math operations, +,-,*,/, all count as 1 FLOP (floating-point operation). • A multiply followed by an add, called an FMA instruction, is handled by special hardware and only counts as 1 FLOP. • A square root operation (sqrt), when handled by a software routine, counts as 21 or 22 FLOPs. • The Power3 has special hardware to handle sqrt operations, but the compiler does not always use the hardware. • Other operations counted: Rounding operations and register moves. • Operations not counted: Floating-point data loads and stores. FPU 0 (Floating-point Unit 0) • L1 and L2 data cache miss rates are not easy to estimate because of the prefetching mechanism present in almost all modern processors. How to calculate cache-miss rate with performance counters on the POWER 3? • Available events: • Load miss @ L1 • Load dispatched • Load completed • Store miss @ L1 • Store dispatched • Store completed • There is a need to research indirect methods to measure L1 and L2 data cache miss rates PIPELINE DISPATCHED COMPLETED L1 hit rate = (100) * (1- (Load misses in L1 + Store misses) / Total Loads and Stores) L2 hit rate = (100) * (1 - (Load misses in L2 + Store misses in L2) / Total L1 Misses) (Speculative Execution: Not all dispatched instructions may get completed) If FLOP = SQRT routine Counter = Counter + (21or 22) • Prefetching reduces the miss rate for sequential data access • Complement of the miss rate can be computed as follows: Yes FPU 1 (Floating-point Unit 1) What tools are used in our research? PAPI: An API (Application Program Interface) is used to access performance counters. It allows the user to dynamically select and display performance counter data, i.e., count specific events. Micro-benchmarks: Small programs designed to use specific computer resources which, in turn, generate events recorded by performance counters. Due to their simplicity, we can, with sufficient knowledge of the target processor architecture, predict event counts, which are used to understand the behavior of the performance counters and the underlying architecture. This metrics were obtained at: http://www.sdsc.edu/SciApps/IBM_tools/ibm_toolkit/HPM_2_4_3.html Cache-Miss Rate Underestimated Cache-Miss Rate Overestimated L1 D cache misses (LOAD MISS @ L1 + STORE MISS @ L1) (LOAD CMPL + STORE CMPL) or No FPU (LOAD MISS @ L1 + STORE MISS @ L1) (LOAD DISP + STORE DISP) Itanium Counter = Counter +1 100.00 % Error 50.00 • Many times an in-depth understanding of the architecture being studied is required to correctly analyze performance data. • Assuming that the commonly used definition for an event holds for any platform may lead to misinterpretation of the performance data. • For instance, in the Power 3, the instruction cache hit event is trigged when a block of instructions (up to 8) is fetched to the instruction buffer from the cache instead of being triggered on every single fetch of an instruction. • By experimentation and trying to answer question 7, we found that in the Power 3, the following relation holds for a sequential program: FMA Hardware 0.00 0 200 400 600 800 -50.00 -100.00 -150.00 Expected value ((PM_IC_HIT - IC_PREF_USED) + PM_IC_MISS) * 8 PM_INST_CMPL Which floating-point operations contribute to total the total floating-point operations completed? Error Difference with respect to Instructions Completed and Dispatched Assembler micro-benchmark example . # Set up input parameters . . lfd fp1,64(SP) # loading a value into a register lfd fp2,72(SP) # loading a value into a register fa fp3,fp1,fp2 # performing a floating-point operation on a number of values stfd fp3,72(SP) # storing the result of the floating-point operation in a register . . . # More operations Error w.r.t. instructions completed Error w.r.t. instructions dispatched 40.00 35.00 30.00 Error (%) • The micro-benchmarks used had to be written in assembler due to the difficulty of triggering specific events on a highlevel language such as C. • A different micro-benchmark was written for each of the operations tested. • The micro-benchmarks revealed that the equation previously thought to give the number of total floating-point operations does not hold. • PM_FPU0_FMOV_FEST must be added to the equation. (The fres instruction, which gives an estimate of the reciprocal of a floating-point operand, will be counted by the PM_FPU0_FMOV_FEST event as well as by the PM_FPU_FEST event. Thus, when using any kind of estimate instruction the proposed equation will count fres instructions twice.) • Division and square root floating-point operations are counted as FMA operations. (STILL UNDER INVESTIGATION) Pentium 150.00 L1 Instruction Cache Hits SQRT Hardware R12k 200.00 Should be better approximation (Recommended by PCAT) (Floating-point Unit ) Pow er3 25.00 20.00 15.00 10.00 5.00 0.00 32 64 128 256 512 1024 2048 4096 8192 16384 32768 Instructions per Benchmark SPONSORS Department of Defense (DOD), MIE (Model Institutions for Excellence) REU (Research Experiences for Undergraduates) Program, and The Dodson Endowment 1000 1200 1400