Lecture 3 Benchmarks and Performance Metrics Performance CS510 Computer Architectures Lecture 3 - 1 Measurement Tools • Benchmarks, Traces, Mixes • Cost, Delay, Area, Power Estimation • Simulation (many levels) – ISA, RT, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental Laws Performance CS510 Computer Architectures Lecture 3 - 2 The Bottom Line: Performance (and Cost) Plane Time (DC-Paris) Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concorde 3.0 hours 1350 mph 132 178,200 • Time to run the task (ExTime) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns ....(Performance) – Throughput, bandwidth Performance CS510 Computer Architectures Lecture 3 - 3 The Bottom Line: Performance (and Cost) “X is n times faster than Y” means: ExTime(Y) n = = ExTime(X) Performance Performance(X) Performance(Y) CS510 Computer Architectures Lecture 3 - 4 Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) = ExTime(X) n = Performance(Y) 1 + 100 100 x (Performance(X) - Performance(Y)) n = Performance(Y) Performance CS510 Computer Architectures Lecture 3 - 5 Example Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? ExTime(Y) = ExTime(X) Performance 15 10 = n = n = 1.5 1.0 = Performance (X) Performance (Y) 100 (1.5 - 1.0) 1.0 CS510 Computer Architectures 50% Lecture 3 - 6 Programs to Evaluate Processor Performance • (Toy) Benchmarks – 10~100-line program – e.g.: sieve, puzzle, quicksort • Synthetic Benchmarks – Attempt to match average frequencies of real workloads – e.g., Whetstone, dhrystone • Kernels – Time critical excerpts of real programs – e.g., Livermore loops • Real programs – e.g., gcc, spice Performance CS510 Computer Architectures Lecture 3 - 7 Benchmarking Games • Differing configurations used to run the same workload on two systems • Compiler wired to optimize the workload • Workload arbitrarily picked • Very small benchmarks used • Benchmarks manually translated to optimize performance Performance CS510 Computer Architectures Lecture 3 - 8 Common Benchmarking Mistakes • • • • Performance Only average behavior represented in test workload Ignoring monitoring overhead Not ensuring same initial conditions “Benchmark Engineering” – particular optimization – different compilers or preprocessors – runtime libraries CS510 Computer Architectures Lecture 3 - 9 SPEC: System Performance Evaluation Cooperative • First Round 1989 – 10 programs yielding a single number • Second Round 1992 – SpecInt92 (6 integer programs) and SpecFP92 (14 floating point programs) – VAX-11/780 • Third Round 1995 – Single flag setting for all programs; new set of programs “benchmarks useful for 3 years” – SPARCstation 10 Model 40 Performance CS510 Computer Architectures Lecture 3 - 10 SPEC First Round • One program: 99% of time in single line of code • New front-end compiler could improve dramatically 800 SPEC Perf 700 600 500 400 300 200 tomcatv fpppp matrix300 eqntott li nasa7 doduc spice epresso 0 gcc 100 Benchmark Performance CS510 Computer Architectures Lecture 3 - 11 How to Summarize Performance • Arithmetic Mean (weighted arithmetic mean) – tracks execution time: S (Ti)/n or S Wi*Ti • Harmonic Mean (weighted harmonic mean) of execution rates (e.g., MFLOPS) – tracks execution time: n/S1/Ri or n/SWi/Ri • Normalized execution time is handy for scaling performance • But do not take the arithmetic mean of normalized execution time, use the geometric mean P(Ri)1/n, where Ri=1/Ti Performance CS510 Computer Architectures Lecture 3 - 12 Comparing and Summarizing Performance Computer A Computer B Computer C P1(secs) 1 10 20 P2(secs) 1,000 100 20 Total time(secs) 1,001 110 40 For program P1, A is 10 times faster than B, For program P2, B is 10 times faster than A, and so on... The relative performance of computer is unclear with Total Execution Times Performance CS510 Computer Architectures Lecture 3 - 13 Summary Measure Arithmetic Mean 1 n n S Execution Timei i=1 Harmonic Mean(When performance is expressed as rates) n n S 1 / Ratei i=1 Ratei = ƒ(1 / Execcution Timei) Good, if programs are run equally in the workload Performance CS510 Computer Architectures Lecture 3 - 14 Unequal Job Mix Relative Performance • Weighted Execution Time - Weighted Arithmetic Mean - Weighted Harmonic Mean n S Weighti x Execution Timei i=1 n S Weighti / Ratei i=1 • Normalized Execution Time to a reference machine - Arithmetic Mean - Geometric Mean n n P Execution Time Ratioi i=1 Performance CS510 Computer Architectures Normalized to the reference machine Lecture 3 - 15 Weighted Arithmetic Mean n WAM(i) = S W(i)j x Timej j=1 A B C W(1) W(2) W(3) P1 (secs) 1.00 10.00 20.00 0.50 0.909 0.999 P2(secs) 1,000.00 100.00 20.00 0.50 0.091 0.001 WAM(1) 500.50 55.00 20.00 WAM(2) 91.91 18.19 20.00 WAM(3) 2.00 10.09 20.00 1.0 x 0.5 + 1,000 x 0.5 Performance CS510 Computer Architectures Lecture 3 - 16 Normalized Execution Time n A B C P1 1.00 10.00 20.00 P2 1,000.00 100.00 20.00 Geometric Mean = n P Execution time ratioi I=1 Normalized to A Performance A B P1 1.0 10.0 P2 1.0 Arithmetic mean C Normalized to B Normalized to C A B C A B C 20.0 0.1 1.0 2.0 0.05 0.5 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0 Geometric mean 1.0 1.0 0.63 1.0 1.0 0.63 1.58 Total time 1.0 0.11 0.04 9.1 1.0 0.36 25.03 2.75 1.0 CS510 Computer Architectures 1.58 1.0 Lecture 3 - 17 Disadvantages of Arithmetic Mean Performance varies depending on the reference machine Normalized to A A C Normalized to C A B C A B C P1 1.0 10.0 20.0 0.1 1.0 2.0 0.05 0.5 1.0 P2 1.0 0.1 0.02 10.0 1.0 0.2 50.0 5.0 1.0 1.0 5.05 10.01 5.05 1.0 1.1 25.03 2.75 1.0 Arithmetic mean B is 5 times slower than A Performance B Normalized to B A is 5 times slower than B C is slowest CS510 Computer Architectures C is fastest Lecture 3 - 18 The Pros and Cons Of Geometric Means • • • Independent of running times of the individual programs Independent of the reference machines Do not predict execution time – the performance of A and B is the same : only true when P1 ran 100 times for every occurrence of P2 1(P1) x 100 + 1000(P2) x 1 = 10(P1) x 100 + 100(P2) x 1 Normalized to A A B P1 1.0 10.0 20.0 P2 1.0 0.1 1.0 1.0 Geometric mean Performance C Normalized to B A Normalized to C B C A 0.1 1.0 2.0 0.02 10.0 1.0 0.2 0.63 1.0 1.0 CS510 Computer Architectures B C 0.05 0.5 1.0 50.0 5.0 1.0 1.58 1.0 0.63 1.58 Lecture 3 - 19 Performance CS510 Computer Architectures Lecture 3 - 20 Performance CS510 Computer Architectures Lecture 3 - 21 Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/E Speedup(E) = = ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected, then: ExTime(E) = Speedup(E) = Performance CS510 Computer Architectures Lecture 3 - 22 Amdahl’s Law ExTimeE = ExTime x (1 - FractionE) + 1 ExTime Speedup = = ExTimeE = Performance FractionE SpeedupE (1 - FractionE) + FractionE SpeedupE 1 (1 - F) + F/S CS510 Computer Architectures Lecture 3 - 23 Amdahl’s Law Floating point instructions are improved to run 2 times(100% improvement); but only 10% of actual instructions are FP Speedup = = 1 (1-F) + F/S 1 = (1-0.1) + 0.1/2 Performance CS510 Computer Architectures 1 = 1.053 0.95 5.3% improvement Lecture 3 - 24 Corollary(Amdahl): Make the Common Case Fast • All instructions require an instruction fetch, only a fraction require a data fetch/store – Optimize instruction access over data access • • Programs exhibit locality Spatial Locality Temporal Locality Access to small memories is faster Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Memory Reg’s Performance Disk / Tape Cache CS510 Computer Architectures Lecture 3 - 25 Locality of Access Spatial Locality: There is a high probability that a set of data, whose address differences are small, will be accessed in small time difference. Temporal Locality: There is a high probability that the recently referenced data will be referenced in near future. Performance CS510 Computer Architectures Lecture 3 - 26 Rule of Thumb • The simple case is usually the most frequent and the easiest to optimize! • Do simple, fast things in hardware(faster) and be sure the rest can be handled correctly in software Performance CS510 Computer Architectures Lecture 3 - 27 Metrics of Performance Application Answers per month Operations per second Programming Language Compiler ISA (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Datapath Control Function Units Transistors Wires Pins Performance Megabytes per second Cycles per second (clock rate) CS510 Computer Architectures Lecture 3 - 28 Aspects of CPU Performance Seconds CPU time = Instructions = Program Performance Cycles x Program Seconds x Instruction CPI Program Inst Count X Compiler X (X) Inst. Set. X X Organization X Technology X CS510 Computer Architectures Cycle Clock Rate X Lecture 3 - 29 Marketing Metrics MIPS = Instruction Count / Time x 106 = Clock Rate / CPI x 106 • Machines with different instruction sets ? • Programs with different instruction mixes ? – Dynamic frequency of instructions • Not correlated with performance MFLOP/s = FP Operations / Time x 106 • Machine dependent • Often not where time is spent Performance Normalized: add,sub,compare, mult 1 divide, sqrt 4 exp, sin, . . . 8 CS510 Computer Architectures Lecture 3 - 30 Cycles Per Instruction Average cycles per instruction CPI = (CPU Time x Clock Rate) / Instruction Count = Cycles / Instruction Count n S CPU time = Cycle Time x CPI x I i i i =1 Instruction Frequency n CPI = S CPIi i =1 x F i ,where Fi = Ii Instruction Count Invest resources where time is spent ! Performance CS510 Computer Architectures Lecture 3 - 31 Organizational Trade-offs Application Programming Language Instruction Mix Compiler CPI ISA Datapath Control Function Units Transistors Wires Pins Performance Cycle Time CS510 Computer Architectures Lecture 3 - 32 Example: Calculating CPI Base Machine (Reg / Reg) Op Freq CPI(i) ALU 50% 1 Load 20% 2 Store 10% 2 Branch 20% 2 CPI .5 .4 .2 .4 1.5 (% Time) (33%) (27%) (13%) (27%) Typical Mix Performance CS510 Computer Architectures Lecture 3 - 33 Example Some of LD instructions can be eliminated by having R/M type ADD instruction [ADD R1, X] Add register / memory operations: R/M – One source operand in memory – One source operand in register – Cycle count of 2 Branch cycle count to increase to 3 What fraction of the loads must be eliminated for this to pay off? Base Machine (Reg / Reg) Op ALU Load Store Branch Freqi 50% 20% 10% 20% CPIi 1 2 2 2 Typical Mix Performance CS510 Computer Architectures Lecture 3 - 34 Example Solution Exec Time = Instr Cnt x CPI x Clock Op ALU Load Store Branch Total Performance Freqi .50 .20 .10 .20 1.00 CPIi CPI 1 .5 2 .4 2 .2 2 .4 1.5 CS510 Computer Architectures Lecture 3 - 35 Example Solution Exec Time = Instr Cnt x CPI x Clock Old Op ALU Load Store Branch Reg/Mem Freqi .50 .20 .10 .20 1.00 CPIi 1 2 2 2 CPI .5 .4 .2 .4 1.5 Freqi .5 - X .2 - X .1 .2 X 1-X New CPIi CPINEW 1 .5 - X 2 .4 - 2X 2 .2 3 .6 2 2X (1.7 - X)/(1 - X) CPINEW must be normalized to new instruction frequency Performance CS510 Computer Architectures Lecture 3 - 36 Example Solution Exec Time = Instr Cnt x CPI x Clock Old Op ALU Load Store Branch Reg/Mem Freq .50 .20 .10 .20 New Cycles CPIOld 1 .5 2 .4 2 .2 2 .4 1.00 1.5 Freq .5 - X .2 - X .1 .2 X 1-X Cycles CPINEW 1 .5 - X 2 .4 - 2X 2 .2 3 .6 2 2X (1.7 - X)/(1 - X) Instr CntOld x CPIOld x Clock = Instr CntNew x CPINew x Clock 1.00 x 1.5 1.5 0.2 = = = (1 - X) x (1.7 - X)/(1 - X) 1.7 - X X All LOADs must be eliminated for this to be a win ! Performance CS510 Computer Architectures Lecture 3 - 37 Fallacies and Pitfalls • MIPS is an accurate measure for comparing performance among computers – dependent on the instruction set – varies between programs on the same computer – can vary inversely to performance • MFLOPS is a consistent and useful measure of performance – dependent on the machine and on the program – not applicable outside the floating-point performance – the set of floating-point operations is not consistent across the machines Performance CS510 Computer Architectures Lecture 3 - 38