CMSC 611: Advanced Computer Architecture Performance & Benchmarks Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / © 2003 Elsevier Science 2 Calculation of CPU Time CPU time = Instructions Clock cycles Seconds ´ ´ Program Instruction Clock cycle CPU time = Instruction count ´ CPI´ Clock cycle time Or Instruction count ´ CPI CPU time = Clock rate Component of performance CPU execution time for a program Instruction count Clock cycles per instructions (CPI) Clock cycle time Units of measure Seconds for the program Instructions executed for the program Average number of clock cycles/instruction Seconds per clock cycle 3 CPU Time (Cont.) • CPU execution time can be measured by running the program • The clock cycle is usually published by the manufacture • Measuring the CPI and instruction count is not trivial – Instruction counts can be measured by: software profiling, using an architecture simulator, using hardware counters on some architecture – The CPI depends on many factors including: processor structure, memory system, the mix of instruction types and the implementation of these instructions 4 CPU Time (Cont.) • Designers sometimes uses the following formula: n CPU clock cycles = åCPIi ´ Ci i =1 Where: Ci is the count of number of instructions of class i executed CPIi is the average number of cycles per instruction for that instruction class n is the number of different instruction classes 5 Example Suppose we have two implementation of the same instruction set architecture. Machine “A” has a clock cycle time of 1 ns and a CPI of 2.0 for some program, and machine “B” has a clock cycle time of 2 ns and a CPI of 1.2 for the same program. Which machine is faster for this program and by how much? Both machines execute the same instructions for the program. Assume the number of instructions is “I”, CPU clock cycles (A) = I 2.0 CPU clock cycles (B) = I 1.2 The CPU time required for each machine is as follows: CPU time (A) = CPU clock cycles (A) Clock cycle time (A) = I 2.0 1 ns = 2 I ns CPU time (B) = CPU clock cycles (B) Clock cycle time (B) = I 1.2 2 ns = 2.4 I ns Therefore machine A will be faster by the following ratio: CPU Performanc e (A) CPU time (B) 2.4 ´ I ns = = = 1.2 CPU Performanc e (B) CPU time (A) 2 ´ I ns 6 Comparing Code Segments A compiler designer is trying to decide between two code sequences for a particular machine. The hardware designers have supplied the following facts: Instruction class A B C CPI for this instruction class 1 2 3 For a particular high-level language statement, the compiler writer is considering two code sequences that require the following instruction counts: Code sequence 1 2 Instruction count for instruction class A B C 2 1 2 4 1 1 Which code sequence executes the most instructions? Which will be faster? What is the CPI for each sequence? Answer: Sequence 1: executes 2 + 1 + 2 = 5 instructions Sequence 2: executes 4 + 1 + 1 = 6 instructions 7 Comparing Code Segments n Using the formula: Sequence 1: Sequence 2: CPU clock cycles = åCPIi ´ Ci i =1 CPU clock cycles = (2 1) + (1 2) + (2 3) = 10 cycles CPU clock cycles = (4 1) + (1 2) + (1 3) = 9 cycles Therefore Sequence 2 is faster although it executes more instructions Using the formula: Sequence 1: Sequence 2: CPI = CPU clock cycles Instruction count CPI = 10/5 = 2 CPI = 9/6 = 1.5 Since Sequence 2 takes fewer overall clock cycles but has more instructions it must have a lower CPI 8 The Role of Performance • Hardware performance is a key to the effectiveness of the entire system • Performance has to be measured and compared to evaluate designs • To optimize the performance, major affecting factors have to be known • For different types of applications – different performance metrics may be appropriate – different aspects of a computer system may be most significant • Instructions use and implementation, memory hierarchy and I/O handling are among the factors that affect the performance 9 Calculation of CPU Time Instruction count ´ CPI CPU time = Clock rate Instr. Count CPI Program X Compiler X X Instruction Set X X Organization X Technology Clock Rate X X n CPU clock cycles = åCPIi ´ Ci i =1 Where: Ci is the count of number of instructions of class i executed CPIi is the average number of cycles per instruction for that instruction class n is the number of different instruction classes 10 Important Equations (so far) 1 Performance = Execution time Performance (B) Time (A) Speedup = = Performance (A) Time (B) Instructions Cycles Seconds CPU time = ´ ´ Program Instruction Cycle n CPU clock cycles = åCPIi ´ Instructionsi i=1 11 Amdahl’s Law The performance enhancement possible with a given improvement is limited by the amount that the improved feature is used Execution time after improvement = Execution time affected by the improvement Amount of improvement + Execution time unaffected • A common theme in Hardware design is to make the common case fast • Increasing the clock rate would not affect memory access time • Using a floating point processing unit does not speed integer ALU operations Example: Floating point instructions improved to run 2X; but only 34% of actual instructions are floating point Exec-Timenew = Exec-Timeold x (0.66 + .34/2) = 0.83 x Exec-Timeold Speedupoverall = Exec-Timeold / Exec-Timenew = 1/0.83 = 1.205 12 Ahmdal’s Law Visually Timeold Timenew 13 Ahmdal’s Law for Speedup ( Timeold = Timeold * Fractionunchanged + Fractionenhanced ) æ Fractionenhanced ö Timenew = Timeold * çFractionunchanged + ÷ Speedupenhanced ø è Speedupoverall = Timeold = Timenew Timeold æ Fractionenhanced ö Timeold * çFractionunchanged + ÷ Speedupenhanced ø è 1 = Fractionenhanced Fractionunchanged + Speedupenhanced Speedupoverall = 1 ( ) 1- Fractionenhanced + Fractionenhanced Speedupenhanced 14 Performance Variations • • • • Performance is dependent on workload Task dependent Need a measure of workload Best = run your program – Often cannot • Benchmark = “typical” workload – Standardized for comparison Synthetic Benchmarks • Synthetic benchmarks are artificial programs that are constructed to match the characteristics of large set of programs • Whetstone (scientific programs), Dhrystone (systems programs), LINPACK (linear algebra), … Synthetic Benchmark Drawbacks 1. They do not reflect the user interest since they are not real applications 2. They do not reflect real program behavior (e.g. memory access pattern) 3. Compiler and hardware can inflate the performance of these programs far beyond what the same optimization can achieve for real-programs 17 Application Benchmarks • Real applications typical of expected workload • Applications and mix important The SPEC Benchmarks • System Performance Evaluation Cooperative • Suite of benchmarks • • • • – Created by a set of companies to improve the measurement and reporting of CPU performance SPEC2006 is the latest suite SPECint = 12 integer programs SPECfloat = 17 floating point programs Since SPEC requires running applications on real hardware, the memory system has a significant effect on performance – Reported with results Performance Reports Hardware Model number CPU FPU (floating point) Number of CPU Cache size per CPU Memory Disk subsystem Network interface Powerstation 550 41.67-MHz POWER 4164 Integrated 1 64K data/8k instruction 64 MB 2 400-MB SCSI N/A Software OS type and revision Compiler revision Other software File system type Firmware level AIX Ver. 3.1.5 AIX XL C/6000 Ver. 1.1.5 AIX XL Fortran Ver. 2.2 None AIX N/A System Tuning parameters Background load System state None None Multi-user (single-user login) Guiding principle is reproducibility (report environment & experiments setup) The SPEC Benchmarks Execution time on SUN SPARCstation 10/40 SPEC ratio = Execution time on the measure machine • Bigger numeric values of the SPEC ratio indicate faster machine • 1997 “historical” reference machine SPEC95 for Pentium and Pentium Pro 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 50 100 150 Clock rate (MHz) 200 Pentium Pentium Pro 250 50 100 150 Clock rate (MHz) 200 250 Pentium Pentium Pro • The performance measured may be different on other Pentium-based hardware with different memory system and using different compilers – At the same clock rate, the SPECint95 measure shows that Pentium Pro is 1.4-1.5 times faster while the SPECfp95 shows that it is 1.7-1.8 times faster – When the clock rate is increased by a certain factor, the processor performance increases by a lower factor Comparing & Summarizing Performance Program 1 (seconds) Program 2 (seconds) Total time (seconds) Computer A 1 1000 1001 Computer B 10 100 110 • Wrong summary can present a confusing picture – A is 10 times faster than B for program 1 – B is 10 times faster than A for program 2 • Total execution time is a consistent summary measure • Relative execution times for the same workload – Assuming that programs 1 and 2 are executing for the same number of times on computers A and B CPU Performance (B) Total execution time (A) 1001 = = = 9.1 CPU Performance (A) Total execution time (B) 110 Execution time is the only valid and unimpeachable measure of performance Effect of Compilation 800 700 600 500 400 300 200 100 0 gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Benchmark Compiler Enhanced compiler App. and arch. specific optimization can dramatically impact performance Performance Summary 1n Arithmetic Mean (AM) = å Execution_ Time i n i =1 n Weighted Arithmetic Mean (WAM) = å wi ´ Execution_Timei i =1 Where: n is the number of programs executed wi is a weighting factor that indicates the frequency of executing program i n with å w i = 1 and i =1 0 £ wi £ 1 • Weighted arithmetic means summarize performance while tracking exec. time • Never use AM for normalizing time relative to a reference machine Time on A Time on B Program 1 Program 2 AM of normalized time AM of time 1 1000 10 100 500.5 55 Norm. to A Norm. to B A B A B 1 10 0.1 1 1 0.1 10 1 1 5.05 5.05 1 1 0 . 1 1 9.1 1 Price-Performance Metric SPECbase CINT2000 • Different results are obtained for other benchmarks, e.g. SPEC CFP2000 • With the exception of the Sunblade price-performance metrics were consistent with performance SPEC CINT2000 per $1000 in price Prices reflects those of July 2001