CS.210 Computer Systems and Architecture <http://spider.science.strath.ac.uk/spider/spider/s howClass.php?class=cs210> and CS.305 Computer Architecture <local.cis.strath.ac.uk/teaching/ug/classes/CS.305> Assessing and Understanding Performance Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available by Dr Mary Jane Irwin, Penn State University. Performance Metrics Purchasing perspective given a collection of machines, which has the • best performance ? • least cost ? • best cost/performance? Design perspective faced with design options, which has the • best performance improvement ? • least cost ? • best cost/performance? Both require basis for comparison metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors Assessing and Understanding Performance CS210_305_05/2 Defining (Speed) Performance Normally interested in reducing Response time (aka execution time) – the time between the start and the completion of a task • Important to individual users Thus, to maximize performance, need to minimize execution time 1 Perform anceX Executiontim eX If X is n times faster than Y, then Perform anceX Executiontim eY n Perform anceY Executiontim eX Throughput – the total amount of work done in a given time • Important to data center managers Decreasing response time almost always improves throughput Assessing and Understanding Performance CS210_305_05/3 Performance Factors Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) – time the CPU spends working on a task Does not include time waiting for I/O or running other programs CPU Timeprogram CPU clock cyclesprogram Clock cycle time or CPU Time program CPU clock cycles program Clock rate Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program Assessing and Understanding Performance CS210_305_05/4 Review: Machine Clock Rate Clock rate (MHz, GHz) is inverse of clock cycle time (clock period): Clock rate 1 Clock cycle one clock period 1 Clock cycle Clock rate 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate Assessing and Understanding Performance CS210_305_05/5 Clock Cycles per Instruction (CPI) Not all instructions take the same amount of time to execute One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction clock cycles instructions average clock cycles program program instruction Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same ISA CPI for this instruction class CPI Assessing and Understanding Performance A B C 1 2 3 CS210_305_05/6 Effective CPI Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging n Overall effective CPI = (CPIi ICi ) i1 Where ICi is the count (percentage) of the number of instructions of class i executed CPIi is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs Assessing and Understanding Performance CS210_305_05/7 THE Performance Equation Our basic performance equation is then CPU Time Instruction _count CPI Clock _ cycle or Instruction _count CPI CPU Time Clock_rate These equations separate the three key factors that affect performance Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details Assessing and Understanding Performance CS210_305_05/8 Determinates of CPU Performance CPU Time Instruction _count CPI Clock _ cycle Instruction count CPI Clock cycle Algorithm Programming language Compiler ISA Processor organization Technology Assessing and Understanding Performance CS210_305_05/9 Determinates of CPU Performance CPU Time Instruction _count CPI Clock _ cycle Instruction count CPI Algorithm X X Programming language X X Compiler X X ISA X X X X X Processor organization Technology Assessing and Understanding Performance Clock cycle X CS210_305_05/10 A Simple Example Op Freq CPIi ALU 50% 1 Load 20% 5 Store 10% 3 Branch 20% 2 Freq x CPIi . = Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? Q2: How does this compare with using branch prediction to shave a cycle off the branch time? Q3: What if two ALU instructions could be executed at once? Assessing and Understanding Performance CS210_305_05/11 A Simple Example Op Freq CPIi Freq x CPIi Q1: Q2: Q3: ALU 50% 1 .5 .5 .5 .25 Load 20% 5 1.0 .4 1.0 1.0 Store 10% 3 .3 .3 .3 .3 Branch 20% 2 .4 .4 .2 .4 2.2 1.6 2.0 1.95 = Q1: How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster Q2: How does this compare with using branch prediction to shave a cycle off the branch time? CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster Q3: What if two ALU instructions could be executed at once? CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster Assessing and Understanding Performance CS210_305_05/12 Comparing and Summarizing Performance How do we summarize the performance for benchmark set with a single number? The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) 1 n AM Timei n i1 Where Timei is the execution time for the ith program of a total of n programs in the workload A smaller mean indicates a smaller average execution time and thus improved performance Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) Assessing and Understanding Performance CS210_305_05/13 SPEC Benchmarks www.spec.org Integer benchmarks gzip compression vpr FPGA place & route gcc GNU C compiler mcf Combinatorial optimization crafty Chess program parser Word processing program eon Computer visualization perlbmk perl application gap vortex bzip2 twolf Group theory interpreter Object oriented database compression Circuit place & route Assessing and Understanding Performance FP benchmarks wupwise Quantum chromodynamics swim Shallow water model mgrid Multigrid solver in 3D fields applu Parabolic/elliptic pde mesa 3D graphics library galgel Computational fluid dynamics art Image recognition (NN) equake Seismic wave propagation simulation facerec Facial image recognition ammp Computational chemistry lucas Primality testing fma3d Crash simulation fem sixtrack Nuclear physics accel apsi Pollutant distribution CS210_305_05/14 Example SPEC Ratings Assessing and Understanding Performance CS210_305_05/15 Other Performance Metrics Power consumption – especially in the embedded market where battery life (and cooling) is important For power-limited applications, the most important metric is energy efficiency Assessing and Understanding Performance CS210_305_05/16 Other Performance Metrics - (Native) MIPS (Native) MIPS - and What Is Wrong with Them The dangers of using metrics other than time in performance measurement can be shown by looking at several popular alternatives. One such alternative is MIPS - Millions of Instructions Per Second Instructioncount Clock rate MIPS 6 Executiontime10 CPI 106 The following form is sometimes convenient since clock rate is fixed for a machine and CPI is usually a small number, unlike instruction count or execution time. Relating MIPS to time, Instructioncount Execution time MIPS106 Assessing and Understanding Performance CS210_305_05/17 The Problem With MIPS The problem with using MIPS as a measure of comparison is threefold: MIPS is dependent on the instruction set, making it difficult to compare MIPS of computers with different instruction sets; MIPS varies between programs on the same computer; and most importantly, MIPS can vary inversely to performance! The classic example of the last case is the MIPS rating of a machine with optional floating-point hardware. Machines with the option yield faster executing programs yet have a lower MIPS rating. Assessing and Understanding Performance CS210_305_05/18 Peak MIPS Beware of so-called peak MIPS. This type of rating is obtained by choosing an instruction mix that minimises the CPI even if that instruction mix is totally impractical. For instance: A program composed entirely of arithmetic and logic operations but no jumps, branches, or load/stores! Or, as in a famous case, a program comprising only of NOPs - No OPerations) In other words: Peak MIPS - a level of performance that will never be attained ;-) Assessing and Understanding Performance CS210_305_05/19 Relative MIPS MIPS can fail to give a true picture of performance in that it does not track execution time. An alternative type MIPS rating is relative MIPS - as opposed to native MIPS derived by using a particular machine as a reference point: Timereference Relative MIPS= MIPSreference Timeunrated Timereference Timeunrated MIPSreference = Execution time of a program on a reference machine = Execution time of the same program on a machine to be rated = Agreed-upon MIPS rating of the reference machine The advantage of this form of MIPS is small since execution time, program, and program input still must be known to have meaningful information. Assessing and Understanding Performance CS210_305_05/20 Summary: Evaluating ISAs Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? How many clocks are required per instruction? CPI How "lean" a clock is practical? Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques. Assessing and Understanding Performance Inst. Count Cycle Time CS210_305_05/21 Fallacies and Pitfalls Pitfall: Expecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement. Example: Suppose a program takes 100 seconds to run and multiply operations account for 80 seconds of this time. How much do you need to improve the speed of multiplication to make the program run five times faster? A: Using Amdahl’s Law: Execution time after improvement = Execution time affected by improvement Execution time unaffected Amount of improvement Assessing and Understanding Performance CS210_305_05/22 Fallacies and Pitfalls …A: Using Amdahl’s Law (and the problem set): 80 seconds Execution time after improvement = (100 80 seconds) n To get 5 times faster the new execution time must be 20 seconds 20 seconds = 80 seconds (20 seconds) n 80 seconds 0 = n I.e there is no amount by which we can enhance multiply to achieve a fivefold improvement in execution time! Making the common case fast… …will tend to enhance performance better than optimising the rare case. Assessing and Understanding Performance CS210_305_05/23 Fallacies and Pitfalls Pitfall: Comparing computers using only one or two of three performance metrics: clock rate, CPI, and instruction count. Pitfall: Using peak performance to compare machines. Fallacy: Synthetic benchmarks predict performance. Synthetic benchmarks are small artificial programs that attempt to represent the execution frequency of statements found in a larger set of benchmarks or in real-world programs. Whetstone and Dhrystone are examples. Since these are not natural programs they can (be used to) distort performance statistics by, e.g.: • compilers discarding large sections of the code! • compilers targeting other optimisation 'opportunities' specifically for a benchmark and, hence, artificially inflate the performance stats. E.g. 20% to 30% improvement by using a string copy 'optimisation' for Dhrystone that could not be applied in over 99% of normal programs!! Assessing and Understanding Performance CS210_305_05/24 Concluding Remarks The task a computer designer faces is a complex one: Determine what attributes are important for a new machine, then design a machine to maximise performance while staying within cost constraints. Performance can be measured as throughput or response time - which depends on the environment/application and should be borne in mind. Amdahl's Law is a valuable tool to determine what performance improvement an architectural enhancement may give. Knowing what cases are the most frequent is critical to improving performance. Based on empirical studies of instruction sets, tradeoffs can be made by deciding which instructions are the most important and what cases to try to make fast. Computer designs will always be measured by cost and performance and finding the best balance will always be the art of computer design, just as in any engineering task. Assessing and Understanding Performance CS210_305_05/25