Measurement tools and techniques Fundamental strategies Interval timers Program profiling Tracing Indirect measurement Copyright 2004 David J. Lilja 1 Events Most measurement tools based on events Some predefined change to system state Definition depends on metric being measured Memory reference Disk access Change in a register’s state Network message Processor interrupt Copyright 2004 David J. Lilja 2 Event Classification Count metrics The number of times event X occurs Number of cache misses Number of I/O operations Copyright 2004 David J. Lilja 3 Event Classification Secondary-event metrics Record a value when triggered by some event Record block size for each I/O operation Count number of operations Find average I/O transfer size Copyright 2004 David J. Lilja 4 Event Classification Profiles Characterization of overall behavior Aggregate/big picture view of an application program Time spent in each function Copyright 2004 David J. Lilja 5 Event-Driven Strategies Record necessary information only when selected event occurs Modify system to record event Dump data when program terminates May need intermediate dumps also E.g. simple counter in page fault routine Copyright 2004 David J. Lilja 6 Event-Driven Strategies System overhead Only when the event of interest actually occurs Infrequent events → little perturbation Frequent events → high perturbation No longer “typical” behavior? Perturbation changes system being measured Copyright 2004 David J. Lilja 7 Event-Driven Strategies Inter-event time is unpredictable Depends on when events actually occur Makes it hard to estimate perturbation How long to measure? Event-driven measurement tools → Good for low-frequency events Copyright 2004 David J. Lilja 8 Event-Driven Strategies +1 +1 +1 +1 +1 +1 +1 +1 Counts 8 events exactly Copyright 2004 David J. Lilja 9 Tracing Similar to event-driven But record additional system state Overhead Event has occurred – count Additional information to uniquely identify event E.g. addresses that cause page faults Additional memory or disk storage Time to save state Relatively large system perturbation Copyright 2004 David J. Lilja 10 Tracing +1; +1; +1; Addr AddrAddr +1; Addr +1; Addr +1; +1; Addr Addr +1; Addr Counts 8 events plus extra data Copyright 2004 David J. Lilja 11 Sampling Record necessary state at fixed time intervals Overhead Independent of specific event frequency Depends on sampling frequency Misses some events Produces statistical summary May miss infrequent events Each replication will produce different results Copyright 2004 David J. Lilja 12 Sampling +1 +1 +1 Counts 3 events out of 5 samples Copyright 2004 David J. Lilja 13 Comparisons Event count Tracing Sampling Resolution Exact count Detailed info Statistical summary Overhead Low High Constant ~ #events High Fixed Perturbation Copyright 2004 David J. Lilja 14 Comparison Event counting Sampling Best for low frequency events Required if exact counts needed Best for high frequency events If statistical summary is adequate Tracing When additional detail is required Copyright 2004 David J. Lilja 15 Indirect Measurements Used when desired metric is not directly accessible Measure one thing directly Derive or deduce desired metric Highly dependent on creativity of performance analyst Copyright 2004 David J. Lilja 16 Interval Timers Fundamental tool of performance measurement Measure execution time of any portion of a program Provide time basis for sampling Copyright 2004 David J. Lilja 17 Interval Timers Actually count clock pulses between two events Event 1 Event 2 Tc x1= Counter x2 = Counter Te=(x2 – x1)Tc Copyright 2004 David J. Lilja 18 Using an Interval Timer Within an application program Start_count = read_timer(); Portion of program to be measured Stop_count = read_timer(); Elapsed_time = (stop_count – start_count) * clock_period; Copyright 2004 David J. Lilja 19 Hardware Timer Tc n-bit counter Clock Te=(x2 – x1)Tc To CPU input port Copyright 2004 David J. Lilja 20 Software Timer T’c Prescalar (divide-by-n) Clock Tc CPU interrupt input Te=(x2 – x1)Tc Software counter Copyright 2004 David J. Lilja 21 Quantization Errors Copyright 2004 David J. Lilja 22 Quantization Error Timer resolution → quantization error Repeated measurements nTc < Te < (n+1)Tc Te rounded to ± one clock tick Completely unpredictable rounding → Want Tc to be as small as possible Copyright 2004 David J. Lilja 23 Timer Rollover n-bit counter Rollover = transition from (2n – 1) → 0 If rollover occurs between start/stop events count = [0, 2n-1] Then count = (x2 – x1) < 0 Check for count < 0 Measure again Add 2n to count Copyright 2004 David J. Lilja 24 Timer Rollover Counter width, n Resolution (Tc) 10 ns 16 32 655 us 43 s 58.5 cent 1 us 65.5 ms 1.2 h 5,580 cent 100 us 6.55 s 5 days 585,000 cent 1 ms 1.1 min 50 days 5,850,000 cent Copyright 2004 David J. Lilja 64 25 Timer Overhead Start_count = read_timer(); Portion of program to be measured Stop_count = read_timer(); Elapsed_time = (stop_count – start_count) * clock_period; To access timer Min of 1 memory read → subroutine call Min of 1 memory write → subroutine call Once at start, again at stop Copyright 2004 David J. Lilja 26 T1 T2 T3 Copyright 2004 David J. Lilja Current time actually read Event ends; Initiate read_time() Event being measured begins Current time actually read Event begins; Initiate read_timer() Timer Overhead T4 27 Timer Overhead T1 = time to read counter value T2 = time to store counter value T3 = time of the event we are measuring T4 = time to read counter value T 4 = T1 T1 T2 T3 Copyright 2004 David J. Lilja T4 28 Timer Overhead Te = event time = T3 But actually measured Tm = T2 + T 3 + T 4 Te = Tm – (T2 + T4) = Tm – (T1 + T2) Timer overhead = Tovhd = (T1 + T2) T1 T2 T3 Copyright 2004 David J. Lilja T4 29 Timer Overhead If Te >> Tovhd If Te ≈ Tovhd Ignore the timer overhead Measurements will be highly suspect Potentially large variations in Tovhd Good rule of thumb Te should be 100-1000x > Tovhd Copyright 2004 David J. Lilja 30 Approximate Measures of Short Intervals How to measure an event that is shorter than the resolution of the clock? Cannot directly measure events with Te < Tc Overhead makes it hard to measure even when Te > nTc, n is small integer Copyright 2004 David J. Lilja 31 Approximate Measures of Short Intervals Tc Te Te Case 1: Count+1 Case 2: Count+0 Copyright 2004 David J. Lilja 32 Approximate Measures of Short Intervals Bernoulli experiment Outcome = +1 with probability p Outcome = +0 with probability (1-p) Equivalent to flipping a biased coin Repeat n times Approximates a binomial distribution Only approximate since each measurement cannot be guaranteed to be independent Usually close enough in practice Copyright 2004 David J. Lilja 33 Approximate Measures of Short Intervals m = number of times Case 1 occurs Count+1 n = total number of measurements Average duration is ratio of m/n Use confidence interval for proportions m Te Tc n Copyright 2004 David J. Lilja 34 Example Clock resolution = 10 us n = 8764 measurements m = 467 clock ticks counted 95% confidence interval 10 us ? ? Case 1: 467 Case 2: 8297 Copyright 2004 David J. Lilja 35 Example 467 467 1 467 8764 8764 (c1 , c2 ) 1.96 8764 8764 (0.0486,0.0580) Scale by clock period = 10 us 95% chance that measured event is (0.49, 0.58) us Copyright 2004 David J. Lilja 36 Profiling Overall view of program’s execution-time behavior Fraction of total time spent in specific states Fraction of time in each subroutine Fraction of time in OS kernel Fraction of time doing I/O Find bottlenecks, code hot-spots Optimize those sections first Copyright 2004 David J. Lilja 37 Statistical Sampling Select a random subset of a population Gather information on only this subset Extrapolate this information to overall population Results are a statistical summary with corresponding error probabilities Copyright 2004 David J. Lilja 38 PC Sampling +1 +1 +1 Periodically interrupt program at fixed intervals Record appropriate state information in interrupt service routine Post-process to obtain overall profile Copyright 2004 David J. Lilja 39 PC Sampling At each interrupt Examine PC on return address stack Use address map to translate this PC to subroutine i Increment array element H[i] PC: 4582 Addr map 0-1298: Subr 1 1299-3455: Subr 2 3456-5567: Subr 3 5568-9943: Subr 4 Copyright 2004 David J. Lilja Histogram counters: H[3]=H[3]+1 40 PC Sampling 140 120 100 80 60 40 20 Copyright 2004 David J. Lilja 2] H [1 1] H [1 0] H [1 ] H [9 ] H [8 ] H [7 ] H [6 ] H [5 ] H [4 ] H [3 ] H [2 H [1 ] 0 41 PC Sampling n total interrupts Post-processing step H[i]/n = fraction of time executing in subroutine i (H[i]/n) * (interrupt period) = time in each subroutine Copyright 2004 David J. Lilja 42 PC Sampling This is a statistical process Different counts each time the experiment is performed Infer behavior of entire program from small sample Apply confidence intervals to quantify precision of results Copyright 2004 David J. Lilja 43 Example 40 us interrupt 36,128 interrupts in subroutine A Program runs for 10 seconds Time in this subroutine? 90% confidence interval m = 36,128 n = 10 sec / 40 us = 250,000 p = m/n = 0.144 Copyright 2004 David J. Lilja 44 Example 0.144512(0.855488) (c1 , c2 ) 0.144512 1.645 250000 (0.144,0.146) 90% chance that the program spent 14.4-14.6% of its time in subroutine A Copyright 2004 David J. Lilja 45 Example 10 ms interrupt 12 interrupts in subroutine A n = 800 samples Time in this subroutine? 8 seconds total execution time 99% confidence interval p = m/n = 0.015 Copyright 2004 David J. Lilja 46 Example 0.015(1 0.015) (c1 , c2 ) 0.015 2.576 800 (0.0039,0.0261) 99% chance that the program spent 31-210 ms in subroutine A A pretty wide range! But only <3% of total execution time Start optimizing somewhere else first Copyright 2004 David J. Lilja 47 Reducing the Interval Size Use a lower confidence level Obtain more samples Run program longer Increase sample rate May not be possible May be fixed by system Will increase overhead and perturbation Run multiple times and add samples from each run Copyright 2004 David J. Lilja 48 PC Sampling +1 +1 Interrupts must occur asynchronously w.r.t. any program events +1 Samples must be independent of each other Else over/under-sample events synchronous with interrupt Periodic versus random sampling Copyright 2004 David J. Lilja 49 Basic Block Counting Basic block Sequence of instructions with no branches into or out of the block When first instruction is executed, guaranteed that all instructions in block will be executed Single entry, single exit Copyright 2004 David J. Lilja 50 Basic Block Counting Generate a program profile by inserting additional instructions in each block Increment a unique counter each time a block is entered Produces a histogram of program execution Can post-process to find instruction execution frequencies Copyright 2004 David J. Lilja 51 Comparison PC sampling Output Overhead Perturbation Repeatability Basic block counting Statistical Exact count estimate Interrupt service Extra instructions routine per block Randomly High distributed Within statistical Perfect variance Copyright 2004 David J. Lilja 52 Event Tracing Profile shows overall frequency-of-execution behavior Ignores time-ordering of events Program trace Dynamic list of events generated by program Events = anything you want to instrument Sequence of memory addresses I/O blocks accessed Typically used to drive a simulator Copyright 2004 David J. Lilja 53 Trace Generation Modify to generate trace Application program Compress Uncompress Trace consumer Copyright 2004 David J. Lilja 54 Trace Generation Modify to generate trace Application program Compress Online trace consumption Uncompress Trace consumer Copyright 2004 David J. Lilja 55 Trace Generation Source-code modification Allows precise control of what events are traced and what data is recorded Typically a manual process Source code Compiler Object code Copyright 2004 David J. Lilja Proc Trace 56 Trace Generation Software exceptions HW forces an exception before each instruction Exception routine decodes instruction Store instr type, PC, operand addresses, etc. “Trace” bit in many processors Tremendous slowdown Source code Compiler Object code Copyright 2004 David J. Lilja Proc Trace 57 Trace Generation Emulation Make a system appear to be something else Modify emulator to generate trace E.g. Java Virtual Machine Source code Compiler Object code Copyright 2004 David J. Lilja Proc Trace 58 Trace Generation Microcode modification Modify instruction execution directly Allows tracing of all instructions Including operating system Depends on access to lower levels of the processor E.g. Transmeta Crusoe processor Source code Compiler Object code Copyright 2004 David J. Lilja Proc Trace 59 Trace Generation Compiler modification Insert trace code directly in object file Requires access to the compiler itself Source code Compiler Object code Copyright 2004 David J. Lilja Proc Trace 60 Trace Generation Compiler modification Insert trace code directly in object file Requires access to the compiler itself Write post-compilation binary editor/rewrite tool Source code Compiler Object code Copyright 2004 David J. Lilja Proc Trace 61 Trace Data Tracing generates a tremendous volume of data Trace 100,000,000 instrs/sec 16 bits of data per event 190 Mbytes of data per second 11 Gbytes per minute Huge perturbations Due to tracing code Time to store trace data Copyright 2004 David J. Lilja 62 Trace Data Compression Modify to generate trace Application program Compress Standard compression algorithms as trace is written to disk Uncompress when reading Typical reduction Uncompress 20-70% Tradeoff is compressuncompress time Trace consumer Copyright 2004 David J. Lilja 63 Online Trace Consumption Modify to generate trace Application program Use trace data as it is generated Never stored on disk Multitasking may lead to non-deterministic behavior Online trace consumption Before-and-after comparison tests Trace consumer Repeatability issue Copyright 2004 David J. Lilja Difference due to change in system or change in trace? Becomes statistical comparison with n runs 64 Abstract Execution Use higher-level information to intelligently compress trace info Two-step process Compiler-style analysis to find critical subset of trace Store only control flow information sufficient to reconstruct trace later Produce trace-regeneration code for subsequent use of trace Copyright 2004 David J. Lilja 65 Abstract Execution 1. if (i > 5) 2. then a = a + i; 3. else b = b + i; 4. i = i + 1; 1. if (i>5) 2. a=a+i 3. b=b+i 4. i=i+1 Trace will be either 1-2-4 1-3-4 Store only 2 or 3 Combine with compilergenerated control flow graph to regenerate trace Slowdown = 2-10x Compress = 10-100x Copyright 2004 David J. Lilja 66 Trace Sampling Save only subsequences of overall trace Drive simulator with samples Results should be statistically similar to driving with complete trace One sample = k consecutive events Sampling interval = P (period) k k P Copyright 2004 David J. Lilja 67 SimPoint Find “representative” program samples Match basic block execution frequencies Clustering tool to automate process Perform detailed timing simulation on only these samples Fast-forward (functional simulation) between samples [Sherwood et al, ASPLOS, 2002] Copyright 2004 David J. Lilja 68 SimPoint Weight each sample’s result by execution frequency to produced overall result Relatively small number (10s) of SimPoints produced ≈3% error in IPC on SPEC Copyright 2004 David J. Lilja 69 SMARTS Uses systematic sampling Fixed sample interval Apply statistical sampling techniques to determine j, k, P Functional simulation Detailed simulation j k j k P Copyright 2004 David J. Lilja 70 Indirect Ad Hoc Techniques Sometimes the desired metric cannot be measured directly Use your creativity to measure one thing and then derive/infer the desired value Copyright 2004 David J. Lilja 71 Example – System Load What is system load? Number of jobs in run queue? Number of jobs actively time-sharing? Fraction of time processor is not in idle loop? Others? How to measure it? Modify OS PC sampling Indirect? Copyright 2004 David J. Lilja 72 Example T Monitor Count n Let system run for fixed time T Note value of counter Copyright 2004 David J. Lilja 73 Example T Monitor n Monitor n/2 App 1 Count Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value Copyright 2004 David J. Lilja 74 Example T Monitor Count n Monitor n/2 App 1 Monitor n/3 App 1 App 2 Let system run for fixed time T Compare value of loaded system monitor counter to unloaded system count value Copyright 2004 David J. Lilja 75 Perturbation To obtain more information (higher resolution) → Use more instrumentation points More instrumentation points → Greater perturbation Copyright 2004 David J. Lilja 76 Perturbation Computer performance measurement uncertainty principle Accuracy is inversely proportional to resolution. High Accuracy Low Resolution Copyright 2004 David J. Lilja High 77 Perturbation Superposition does not work here Double instrumentation ≠ double impact on performance Non-linear Non-additive Some instrumentation cancels out Some multiplies impact No way to predict! Copyright 2004 David J. Lilja 78 Instrumentation Code Changes memory access patterns Generates additional load/store instructions More frequent cache flushes and replacements But may reduce set associativity conflicts Generates more I/O operations Will increase overall execution time Affects memory banking optimizations More time-sharing context switches Alters virtual memory paging behavior Copyright 2004 David J. Lilja 79 Important Points Event types Simple counts of primary event Secondary events triggered by some primary event Overall profiles Copyright 2004 David J. Lilja 80 Important Points Measurement strategies Event-driven Tracing Sampling Indirect approaches Copyright 2004 David J. Lilja 81 Important Points Interval timers Stopwatch functionality Rollover problem Overhead Quantization errors Statistical measures of short intervals Copyright 2004 David J. Lilja 82 Important Points Profiling PC sampling Statistical view Basic block counting Exact behavior High overhead and perturbation Copyright 2004 David J. Lilja 83 Important Points Trace generation Source code modification Force exceptions Emulation Microcode modification Compiler modification Object code editor Online trace consumption Trace sampling Copyright 2004 David J. Lilja 84 Important Points Indirect measurements when all else fails System load example Perturbations Nobody likes them Have to learn to live with them Copyright 2004 David J. Lilja 85