Computer Performance and Cost Pradondet Nilagupta Fall 2005 (original notes from Randy Katz, & Prof. Jan M. Rabaey , UC Berkeley Prof. Shaaban, RIT) 204521 Digital System Architecture March 14, 2016 API ISA Link I/O Chan Review: What is Computer Architecture? Interfaces Technology IR Regs Machine Organization Applications Computer Architect Measurement & Evaluation 204521 Digital System Architecture March 14, 2016 2 Review: Computer System Components CPU Core 1 GHz - 3.8 GHz 4-way Superscaler RISC or RISC-core (x86): Deep Instruction Pipelines Dynamic scheduling Multiple FP, integer FUs Dynamic branch prediction Hardware speculation SDRAM PC100/PC133 100-133MHZ 64-128 bits wide 2-way inteleaved ~ 900 MBYTES/SEC )64bit) Current Standard Double Date Rate (DDR) SDRAM PC3200 200 MHZ DDR 64-128 bits wide 4-way interleaved ~3.2 GBYTES/SEC (one 64bit channel) ~6.4 GBYTES/SEC (two 64bit channels) L1 CPU All Non-blocking caches L1 16-128K 1-2 way set associative (on chip), separate or unified L2 256K- 2M 4-32 way set associative (on chip) unified L3 2-16M 8-32 way set associative (off or on chip) unified L2 Examples: Alpha, AMD K7: EV6, 200-400 MHz Intel PII, PIII: GTL+ 133 MHz Intel P4 800 MHz L3 Caches Front Side Bus (FSB) Off or On-chip adapters Memory Controller Memory Bus I/O Buses NICs Controllers Example: PCI, 33-66MHz 32-64 bits wide 133-528 MBYTES/SEC PCI-X 133MHz 64 bit 1024 MBYTES/SEC Memory RAMbus DRAM (RDRAM) 400MHZ DDR 16 bits wide (32 banks) ~ 1.6 GBYTES/SEC 204521 Digital System Architecture Disks Displays Keyboards Networks I/O Devices: North Bridge South Bridge I/O Subsystem Chipset March 14, 2016 3 The Architecture Process Estimate Cost & Performance Sort New concepts created Bad ideas 204521 Digital System Architecture March 14, 2016 Mediocre ideas Good ideas 4 Performance Measurement and Evaluation Many dimensions to computer performance – CPU execution time P • by instruction or sequence – floating point – integer – branch performance – Cache bandwidth – Main memory bandwidth – I/O performance C • bandwidth • seeks • pixels or polygons per second M Relative importance depends on applications 204521 Digital System Architecture March 14, 2016 5 Evaluation Tools Benchmarks, traces, & mixes – macrobenchmarks & suites • application execution time – microbenchmarks • measure one aspect of performance – traces • replay recorded accesses – cache, branch, register Simulation at many levels – ISA, cycle accurate, RTL, gate, circuit • trade fidelity for simulation rate Area and delay estimation Analysis – e.g., queuing theory 204521 Digital System Architecture March 14, 2016 MOVE BR LOAD STORE ALU 39% 20% 20% 10% 11% LD 5EA3 ST 31FF …. LD 1EA2 …. 6 Benchmarks Microbenchmarks Macrobenchmarks – measure one performance dimension – application execution time • cache bandwidth • main memory bandwidth • procedure call overhead • FP performance • measures overall performance, but on just one application 204521 Digital System Architecture March 14, 2016 Macro Micro – weighted combination of microbenchmark performance is a good predictor of application performance – gives insight into the cause of performance bottlenecks Applications Perf. Dimensions 7 Some Warnings about Benchmarks Benchmarks measure the whole system – application – compiler – operating system – architecture – implementation Popular benchmarks typically reflect yesterday’s programs – computers need to be designed for tomorrow’s programs 204521 Digital System Architecture Benchmark timings often very sensitive to – alignment in cache – location of data on disk – values of data Benchmarks can lead to inbreeding or positive feedback – if you make an operation fast (slow) it will be used more (less) often March 14, 2016 • so you make it faster (slower) – and it gets used even more (less) » and so on… 8 Choosing Programs To Evaluate Performance Levels of programs or benchmarks that could be used to evaluate performance: – Actual Target Workload: Full applications that run on the target machine. – Real Full Program-based Benchmarks: • Select a specific mix or suite of programs that are typical of targeted applications or workload (e.g SPEC95, SPEC CPU2000). – Small “Kernel” Benchmarks: • Key computationally-intensive pieces extracted from real programs. – Examples: Matrix factorization, FFT, tree search, etc. • Best used to test specific aspects of the machine. – Microbenchmarks: • Small, specially written programs to isolate a specific aspect of performance characteristics: Processing: integer, floating point, local memory, input/output, etc. 204521 Digital System Architecture March 14, 2016 9 Types of Benchmarks Cons Pros • Representative • Portable. • Widely used. • Measurements useful in reality. Actual Target Workload Full Application Benchmarks • Easy to run, early in the design cycle. • Identify peak performance and potential bottlenecks. 204521 Digital System Architecture Small “Kernel” Benchmarks Microbenchmarks March 14, 2016 • Very specific. • Non-portable. • Complex: Difficult to run, or measure. • Less representative than actual workload. • Easy to “fool” by designing hardware to run them well. • Peak performance results may be a long way from real application performance 10 SPEC: System Performance Evaluation Cooperative The most popular and industry-standard set of CPU benchmarks. SPECmarks, 1989: – 10 programs yielding a single number (“SPECmarks”). SPEC92, 1992: – SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs). SPEC95, 1995: – SPECint95 (8 integer programs): • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex – SPECfp95 (10 floating-point intensive programs): • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 – Performance relative to a Sun SuperSpark I (50 MHz) which is given a score of SPECint95 = SPECfp95 = 1 SPEC CPU2000, 1999: – CINT2000 (11 integer programs). CFP2000 (14 floating-point intensive programs) – Performance relative to a Sun Ultra5_10 (300 MHz) which is given a score of SPECint2000 = SPECfp2000 = 100 204521 Digital System Architecture March 14, 2016 11 SPEC First Round 800 700 600 SPEC Perf One program: 99% of time in single line of code New front-end compiler could improve dramatically 500 400 300 200 100 tomcatv fpppp matrix300 eqntott li nasa7 doduc spice epresso gcc 0 Benchmark 204521 Digital System Architecture March 14, 2016 12 Impact of Means on SPECmark89 for IBM 550 (without and with special compiler option) Ratio to VAX: Program gcc espresso 35 spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Mean Before 30 34 47 46 78 34 40 78 90 33 54 After 29 65 47 49 144 34 40 730 87 138 72 Geometric Ratio 1.33 204521 Digital System Architecture Time: Before 49 67 510 41 258 183 28 58 34 20 124 After 51 7.64 510 38 140 183 28 6 35 19 108 Arithmetic Ratio 1.16 March 14, 2016 Weighted Time: Before 8.91 7.86 5.69 5.81 3.43 7.86 6.68 3.43 2.97 2.01 54.42 After 9.22 5.69 5.45 1.86 7.86 6.68 0.37 3.07 1.94 49.99 Weighted Arith. Ratio 1.09 13 SPEC CPU2000 Programs CINT2000 (Integer) CFP2000 (Floating Point) Benchmark Language 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi Descriptions C C C C C C C++ C C C C C Fortran 77 Fortran 77 Fortran 77 Fortran 77 C Fortran 90 C C Fortran 90 C Fortran 90 Fortran 90 Fortran 77 Fortran 77 Programs application domain: Engineering and scientific computation 204521 Digital System Architecture March 14, 2016 Compression FPGA Circuit Placement and Routing C Programming Language Compiler Combinatorial Optimization Game Playing: Chess Word Processing Computer Visualization PERL Programming Language Group Theory, Interpreter Object-oriented Database Compression Place and Route Simulator Physics / Quantum Chromodynamics Shallow Water Modeling Multi-grid Solver: 3D Potential Field Parabolic / Elliptic Partial Differential Equations 3-D Graphics Library Computational Fluid Dynamics Image Recognition / Neural Networks Seismic Wave Propagation Simulation Image Processing: Face Recognition Computational Chemistry Number Theory / Primality Testing Finite-element Crash Simulation High Energy Nuclear Physics Accelerator Design Meteorology: Pollutant Distribution Source: http://www.spec.org/osg/cpu2000/ 14 Top 20 SPEC CPU2000 Results (As of March 2002) Top 20 SPECint2000 Top 20 SPECfp2000 # MHz Processor int peak int base MHz Processor fp peak 1 1300 POWER4 814 790 1300 POWER4 1169 2 2200 Pentium 4 811 790 1000 Alpha 21264C 960 3 2200 Pentium 4 Xeon 810 788 1050 UltraSPARC-III Cu 827 4 1667 Athlon XP 724 697 2200 Pentium 4 Xeon 802 779 5 1000 Alpha 21264C 679 621 2200 Pentium 4 801 779 6 1400 Pentium III 664 648 833 Alpha 21264B 7 1050 UltraSPARC-III Cu 610 537 800 Itanium 701 701 8 1533 Athlon MP 609 587 833 Alpha 21264A 644 571 9 750 PA-RISC 8700 604 568 1667 Athlon XP 642 596 10 833 Alpha 21264B 571 497 750 PA-RISC 8700 581 526 11 1400 Athlon 554 495 1533 Athlon MP 547 504 12 833 Alpha 21264A 533 511 600 MIPS R14000 529 499 13 600 MIPS R14000 500 483 675 SPARC64 GP 509 371 14 675 SPARC64 GP 478 449 900 UltraSPARC-III 15 900 UltraSPARC-III 467 438 1400 Athlon 458 426 16 552 PA-RISC 8600 441 417 1400 Pentium III 456 437 17 750 POWER RS64-IV 439 409 500 PA-RISC 8600 440 397 18 700 Pentium III Xeon 438 431 450 POWER3-II 433 426 19 800 Itanium 365 358 500 Alpha 21264 422 383 20 400 MIPS R12000 353 328 400 MIPS R12000 407 382 784 482 fp base 1098 776 701 643 427 Source: http://www.aceshardware.com/SPECmine/top.jsp 204521 Digital System Architecture March 14, 2016 15 Common Benchmarking Mistakes (1/2) Only average behavior represented in test workload Skewness of device demands ignored Loading level controlled inappropriately Caching effects ignored Buffer sizes not appropriate Inaccuracies due to sampling ignored 204521 Digital System Architecture March 14, 2016 16 Common Benchmarking Mistakes (2/2) Ignoring monitoring overhead Not validating measurements Not ensuring same initial conditions Not measuring transient (cold start) performance Using device utilizations for performance comparisons Collecting too much data but doing too little analysis 204521 Digital System Architecture March 14, 2016 17 Architectural Performance Laws and Rules of Thumb 204521 Digital System Architecture March 14, 2016 Measurement and Evaluation Architecture is an iterative process: – Searching the space of possible designs – Make selections – Evaluate the selections made Good measurement tools are required to accurately evaluate the selection. 204521 Digital System Architecture March 14, 2016 19 Measurement Tools Benchmarks, Traces, Mixes Cost, delay, area, power estimation Simulation (many levels) – ISA, RTL, Gate, Circuit Queuing Theory Rules of Thumb Fundamental Laws 204521 Digital System Architecture March 14, 2016 20 Measuring and Reporting Performance What do we mean by one Computer is faster than another? – program runs less time Response time or execution time – time that users see the output Elapsed time – A latency to complete a task including disk accesses, memory accesses, I/O activities, operating system overhead Throughput – total amount of work done in a given time 204521 Digital System Architecture March 14, 2016 21 Performance “Increasing and decreasing” ????? We use the term “improve performance” or “ improve execution time” When we mean increase performance and decrease execution time . improve performance = increase performance improve execution time = decrease execution time 204521 Digital System Architecture March 14, 2016 22 Metrics of Performance Application Answers per month Operations per second Programming Language Compiler ISA (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Datapath Control Function Units Transistors Wires Pins 204521 Digital System Architecture Megabytes per second Cycles per second (clock rate) March 14, 2016 Does Anybody Really Know What Time it is? UNIX Time Command: 90.7u 12.9s 2:39 65% User CPU Time (Time spent in program) System CPU Time (Time spent in OS) Elapsed Time (Response Time = 159 Sec.) (90.7+12.9)/159 * 100 = 65%, % of lapsed time that is CPU time. 45% of the time spent in I/O or running other programs 204521 Digital System Architecture March 14, 2016 Example UNIX time command 90.7u 12.95 2:39 65% user CPU time is 90.7 sec system CPU time is 12.9 sec elapsed time is 2 min 39 sec. (159 sec) % of elapsed time that is CPU time is 90.7 + 12.9 = 65% 159 204521 Digital System Architecture March 14, 2016 25 Time CPU time – time the CPU is computing – not including the time waiting for I/O or running other program User CPU time – CPU time spent in the program System CPU time – CPU time spent in the operating system performing task requested by the program decrease execution time CPU time = User CPU time + System CPU time 204521 Digital System Architecture March 14, 2016 26 Performance System Performance – elapsed time on unloaded system CPU performance – user CPU time on an unloaded system 204521 Digital System Architecture March 14, 2016 27 Two notions of “performance” Plane DC to Paris Speed Passengers Throughput Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concorde 3 hours 1350 mph 132 178,200 Which has higher performance? ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. – throughput, bandwidth Response time and throughput often are in opposition 204521 Digital System Architecture March 14, 2016 28 Example • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time We will focus primarily on execution time for a single job 204521 Digital System Architecture March 14, 2016 29 Computer Performance Measures: Program Execution Time (1/2) For a specific program compiled to run on a specific machine (CPU) “A”, the following parameters are provided: – The total instruction count of the program. – The average number of cycles per instruction (average CPI). – Clock cycle of machine “A” 204521 Digital System Architecture March 14, 2016 30 Computer Performance Measures: Program Execution Time (2/2) How can one measure the performance of this machine running this program? – Intuitively the machine is said to be faster or has better performance running this program if the total execution time is shorter. – Thus the inverse of the total measured program execution time is a possible performance measure or metric: PerformanceA = 1 / Execution TimeA How to compare performance of different machines? What factors affect performance? How to improve performance?!!!! 204521 Digital System Architecture March 14, 2016 31 Comparing Computer Performance Using Execution Time To compare the performance of two machines (or CPUs) “A”, “B” running a given specific program: PerformanceA = 1 / Execution TimeA PerformanceB = 1 / Execution TimeB Machine A is n times faster than machine B means (or slower? if n < 1) : PerformanceA Speedup = n = PerformanceB 204521 Digital System Architecture = March 14, 2016 Execution TimeB Execution TimeA 32 Example For a given program: Execution time on machine A: ExecutionA = 1 second Execution time on machine B: ExecutionB = 10 seconds Performanc e Execution Time A B Speedup Performanc e Execution Time B A 10 10 1 The performance of machine A is 10 times the performance of machine B when running this program, or: Machine A is said to be 10 times faster than machine B when running this program. The two CPUs may target different ISAs provided the program is written in a high level language (HLL) 204521 Digital System Architecture March 14, 2016 33 CPU Execution Time: The CPU Equation A program is comprised of a number of instructions executed , I – Measured in: instructions/program The average instruction executed takes a number of cycles per instruction (CPI) to be completed. – Measured in: cycles/instruction, CPI CPU has a fixed clock cycle time C = 1/clock rate – Measured in: seconds/cycle CPU execution time is the product of the above three parameters as follows: CPU time = Seconds = Instructions x Cycles Program T = execution Time per program in seconds 204521 Digital System Architecture Program I x Seconds Instruction x CPI x Number of instructions executed Average CPI for program March 14, 2016 Cycle C CPU Clock Cycle 34 CPU Execution Time: Example A Program is running on a specific machine with the following parameters: – Total executed instruction count: 10,000,000 instructions Average CPI for the program: 2.5 cycles/instruction. – CPU clock rate: 200 MHz. (clock cycle = 5x10-9 seconds) What is the execution time for this program: CPU time = Seconds Program = Instructions x Cycles Program Instruction x Seconds Cycle CPU time = Instruction count x CPI x Clock cycle = 10,000,000 x 2.5 x 1 / clock rate = 10,000,000 x 2.5 x 5x10-9 = .125 seconds 204521 Digital System Architecture March 14, 2016 35 Aspects of CPU Execution Time CPU Time = Instruction count x CPI x Clock cycle Depends on: T = I x CPI x C Program Used Compiler ISA Instruction Count I (executed) Depends on: Program Used Compiler ISA CPU Organization 204521 Digital System Architecture CPI (Average CPI) March 14, 2016 Clock Cycle C Depends on: CPU Organization Technology (VLSI) 36 Factors Affecting CPU Performance CPU time = Seconds = Instructions x Cycles Program Program Instruction Instruction Count I CPI Program X X Compiler X X Instruction Set Architecture (ISA) X X X Organization (CPU Design) Technology Cycle Clock Cycle C X X (VLSI) 204521 Digital System Architecture x Seconds March 14, 2016 37 Performance Comparison: Example From the previous example: A Program is running on a specific machine with the following parameters: – Total executed instruction count, I: 10,000,000 instructions – Average CPI for the program: 2.5 cycles/instruction. – CPU clock rate: 200 MHz. Using the same program with these changes: – A new compiler used: New instruction count 9,500,000 New CPI: 3.0 – Faster CPU implementation: New clock rate = 300 MHZ What is the speedup with the changes? Speedup = Old Execution Time = Iold x New Execution Time Inew x CPIold x Clock cycleold CPInew x Clock Cyclenew Speedup = (10,000,000 x 2.5 x 5x10-9) / (9,500,000 x 3 x 3.33x10-9 ) = .125 / .095 = 1.32 or 32 % faster after changes. 204521 Digital System Architecture March 14, 2016 38 Instruction Types & CPI Given a program with n types or classes of instructions executed on a given CPU with the following characteristics: Ci = Count of instructions of typei CPIi = Cycles per instruction for typei i = 1, 2, …. n Then: CPI = CPU Clock Cycles / Instruction Count I Where: n CPU clock cycles i 1 CPI C i i Instruction Count I = S Ci 204521 Digital System Architecture March 14, 2016 39 Instruction Types & CPI: An Example An instruction set has three instruction classes: Instruction class A B C CPI 1 2 3 For a specific CPU design Two code sequences have the following instruction counts: Code Sequence 1 2 Instruction counts for instruction class A B C 2 1 2 4 1 1 CPU cycles for sequence 1 = 2 x 1 + 1 x 2 + 2 x 3 = 10 cycles CPI for sequence 1 = clock cycles / instruction count = 10 /5 = 2 CPU cycles for sequence 2 = 4 x 1 + 1 x 2 + 1 x 3 = 9 cycles CPI for sequence 2 = 9 / 6 = 1.5 204521 Digital System Architecture March 14, 2016 40 Instruction Frequency & CPI Given a program with n types or classes of instructions with the following characteristics: Ci = Count of instructions of typei CPIi = Average cycles per instruction of typei Fi = Frequency or fraction of instruction typei executed = Ci/ total executed instruction count = Ci/ I Then: CPI CPI n i 1 i F i Fraction of total execution time for instructions of type i = 204521 Digital System Architecture March 14, 2016 CPIi x Fi CPI 41 Instruction Type Frequency & CPI: A RISC Example CPIi x Fi Program Profile or Executed Instructions Mix Base Machine (Reg / Reg) Op Freq, Fi CPIi ALU 50% 1 Load 20% 5 Store 10% 3 Branch 20% 2 Typical Mix n CPI CPIi x Fi .5 1.0 .3 .4 % Time 23% = .5/2.2 45% = 1/2.2 14% = .3/2.2 18% = .4/2.2 Sum = 2.2 CPI CPI i F i i 1 204521 Digital System Architecture March 14, 2016 42 Performance Terminology “X is n% faster than Y” means: ExTime(Y) Performance(X) --------- = ExTime(X) -------------= 1 + ----Performance(Y) n 100 n = 100(Performance(X) - Performance(Y)) Performance(Y) n = 100(ExTime(Y) - ExTime(X)) ExTime(X) 204521 Digital System Architecture March 14, 2016 43 Example Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? n = 100(ExTime(Y) - ExTime(X)) ExTime(X) n = 100(15 - 10) 10 n = 50% 204521 Digital System Architecture March 14, 2016 Speedup Speedup due to enhancement E: ExTime w/o E Speedup(E) = ------------ExTime w/ E = Performance w/ E ------------------Performance w/o E Suppose that enhancement E accelerates a fractionenhanced of the task by a factor Speedupenhanced , and the remainder of the task is unaffected, then what is ExTime(E) = ? Speedup(E) = ? 204521 Digital System Architecture March 14, 2016 45 Amdahl’s Law States that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time faster mode can be used Speedup = Performance for entire task using the enhancement Performance for the entire task without using the enhancement or Speedup = Execution time without the enhancement Execution time for entire task using the enhancement 204521 Digital System Architecture March 14, 2016 46 Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced Speedupoverall = ExTimeold ExTimenew 1 = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 204521 Digital System Architecture March 14, 2016 Example of Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = Speedupoverall = 204521 Digital System Architecture March 14, 2016 Example of Amdahl’s Law Floating point instructions improved to run 2X; but only 10% of actual instructions are FP ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold Speedupoverall = 204521 Digital System Architecture 1 0.95 March 14, 2016 = 1.053 Performance Enhancement Calculations: Amdahl's Law The performance enhancement possible due to a given design improvement is limited by the amount that the improved feature is used Amdahl’s Law: Performance improvement or speedup due to enhancement E: Execution Time without E Speedup(E) = -------------------------------------Execution Time with E Performance with E = --------------------------------Performance without E – Suppose that enhancement E accelerates a fraction F of the execution time by a factor S and the remainder of the time is unaffected then: Execution Time with E = ((1-F) + F/S) X Execution Time without E Hence speedup is given by: Execution Time without E 1 Speedup(E) = --------------------------------------------------------- = -------------------((1 - F) + F/S) X Execution Time without E (1 - F) + F/S 204521 Digital System Architecture March 14, 2016 50 Pictorial Depiction of Amdahl’s Law Enhancement E accelerates fraction F of original execution time by a factor of S Before: Execution Time without enhancement E: (Before enhancement is applied) • shown normalized to 1 = (1-F) + F =1 Unaffected fraction: (1- F) Affected fraction: F Unchanged Unaffected fraction: (1- F) F/S After: Execution Time with enhancement E: Execution Time without enhancement E 1 Speedup(E) = ------------------------------------------------------ = -----------------Execution Time with enhancement E (1 - F) + F/S 204521 Digital System Architecture March 14, 2016 51 Performance Enhancement Example For the RISC machine with the following instruction mix given earlier: Op Freq Cycles CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement: Fraction enhanced = F = 45% or .45 Unaffected fraction = 100% - 45% = 55% or .55 Factor of enhancement = 5/2 = 2.5 Using Amdahl’s Law: 1 1 Speedup(E) = ------------------ = --------------------- = (1 - F) + F/S .55 + .45/2.5 204521 Digital System Architecture March 14, 2016 1.37 52 An Alternative Solution Using CPU Equation Op Freq Cycles CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% If a CPU design enhancement improves the CPI of load instructions from 5 to 2, what is the resulting performance improvement from this enhancement: Old CPI = 2.2 New CPI = .5 x 1 + .2 x 2 + .1 x 3 + .2 x 2 = 1.6 Original Execution Time Speedup(E) = ----------------------------------New Execution Time Instruction count x old CPI x clock cycle = ---------------------------------------------------------------Instruction count x new CPI x clock cycle old CPI = ------------ = new CPI 2.2 --------1.6 = 1.37 Which is the same speedup obtained from Amdahl’s Law in the first solution. 204521 Digital System Architecture March 14, 2016 53 Performance Enhancement Example (1/2) A program runs in 100 seconds on a machine with multiply operations responsible for 80 seconds of this time. By how much must the speed of multiplication be improved to make the program four times faster? Desired speedup = 4 = 100 ----------------------------------------------------Execution Time with enhancement Execution time with enhancement = 25 seconds 25 seconds = (100 - 80 seconds) + 80 seconds / n 25 seconds = 20 seconds + 80 seconds / n 5 = 80 seconds / n n = 80/5 = 16 Hence multiplication should be 16 times faster to get a speedup of 4. 204521 Digital System Architecture March 14, 2016 54 Performance Enhancement Example (2/2) For the previous example with a program running in 100 seconds on a machine with multiply operations responsible for 80 seconds of this time. By how much must the speed of multiplication be improved to make the program five times faster? Desired speedup = 5 = 100 ----------------------------------------------------Execution Time with enhancement Execution time with enhancement = 20 seconds 20 seconds = (100 - 80 seconds) + 80 seconds / n 20 seconds = 20 seconds + 80 seconds / n 0 = 80 seconds / n No amount of multiplication speed improvement can achieve this. 204521 Digital System Architecture March 14, 2016 55 Extending Amdahl's Law To Multiple Enhancements Suppose that enhancement Ei accelerates a fraction Fi of the execution time by a factor Si and the remainder of the time is unaffected then: Original Execution Time Speedup (1 i F i ) i F i XOriginal Execution Time ( Speedup S ) i 1 ((1 F ) F ) i i i i S i Note: All fractions Fi refer to original execution time before the enhancements are applied. . 204521 Digital System Architecture March 14, 2016 56 Amdahl's Law With Multiple Enhancements: Example Three CPU performance enhancements are proposed with the following speedups and percentage of the code execution time affected: Speedup1 = S1 = 10 Percentage1 = F1 = 20% Speedup2 = S2 = 15 Percentage1 = F2 = 15% Speedup3 = S3 = 30 Percentage1 = F3 = 10% While all three enhancements are in place in the new design, each enhancement affects a different portion of the code and only one enhancement can be used at a time. What is the resulting overall speedup? Speedup 1 ((1 F ) F ) i i i i S i Speedup = 1 / [(1 - .2 - .15 - .1) + .2/10 + .15/15 + .1/30)] = 1/ [ .55 + .0333 ] = 1 / .5833 = 1.71 204521 Digital System Architecture March 14, 2016 57 Pictorial Depiction of Example Before: Execution Time with no enhancements: 1 Unaffected, fraction: .55 S1 = 10 F1 = .2 / 10 S2 = 15 S3 = 30 F2 = .15 F3 = .1 / 15 / 30 Unchanged Unaffected, fraction: .55 After: Execution Time with enhancements: .55 + .02 + .01 + .00333 = .5833 Speedup = 1 / .5833 = 1.71 Note: All fractions (Fi , i = 1, 2, 3) refer to original execution time. 204521 Digital System Architecture March 14, 2016 58 Means Arithmetic mean 1 n Ti n i 1 Harmonic mean n n 1 i 1 R i Can be weighted. Represents total execution time 1 n Geometric mean 1 n n Ti Ti exp log i 1 Tri Tri n i 1 204521 Digital System Architecture March 14, 2016 Consistent independent of reference 61 How to Summarize Performance (1/2) Arithmetic mean (weighted arithmetic mean) tracks execution time: T or W *T i i n i Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n 1 204521 Digital System Architecture or Ri n Wi March 14, 2016 Ri 62 How to Summarize Performance (2/2) Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10) – Arithmetic mean impacted by choice of reference machine Use the geometric mean for comparison: n Ti – Independent of chosen machine – but not good metric for total execution time 204521 Digital System Architecture March 14, 2016 63 Summarizing Performance Arithmetic mean – Average execution time – Gives more weight to longer-running programs Weighted arithmetic mean – More important programs can be emphasized – But what do we use as weights? – Different weight will make different machines look better • (see Figure 1.16 in your text book – Hennessey/Patterson 3rd Edition) • (see board for extension of previous example) 204521 Digital System Architecture March 14, 2016 64 Normalizing & the Geometric Mean Normalized execution time – Take a reference machine R – Time to run A on X / Time to run A on R Geometric mean n n Normalized execution time on i i 1 Neat property of the geometric mean: Consistent whatever the reference machine Do not use the arithmetic mean for normalized execution times 204521 Digital System Architecture March 14, 2016 65 Which Machine is “Better”? Computer A Program P1(sec) 1 Program P2 1000 Total Time 1001 204521 Digital System Architecture Computer B 10 100 110 March 14, 2016 Computer C 20 20 40 Weighted Arithmetic Mean Assume three weighting schemes P1/P2 .5/.5 .909/.091 .999/.001 Comp A 500.5 91.82 2 204521 Digital System Architecture Comp B 55.0 18.8 10.09 March 14, 2016 Comp C 20 20 20 Performance Evaluation Given sales is a function of performance relative to the competition, big investment in improving product as reported by performance summary Good products created then have: – Good benchmarks – Good ways to summarize performance If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales; Sales almost always wins! Execution time is the measure of performance 204521 Digital System Architecture March 14, 2016 Computer Performance Measures : MIPS Rating (1/3) For a specific program running on a specific CPU the MIPS rating is a measure of how many millions of instructions are executed per second: MIPS Rating = Instruction count / (Execution Time x 106) = Instruction count / (CPU clocks x Cycle time x 106) = (Instruction count x Clock rate) / (Instruction count x CPI x 106) = Clock rate / (CPI x 106) Major problem with MIPS rating: As shown above the MIPS rating does not account for the count of instructions executed (I). – A higher MIPS rating in many cases may not mean higher performance or better execution time. i.e. due to compiler design variations. 204521 Digital System Architecture March 14, 2016 69 Computer Performance Measures : MIPS Rating (2/3) In addition the MIPS rating: – Does not account for the instruction set architecture (ISA) used. • Thus it cannot be used to compare computers/CPUs with different instruction sets. – Easy to abuse: Program used to get the MIPS rating is often omitted. • Often the Peak MIPS rating is provided for a given CPU which is obtained using a program comprised entirely of instructions with the lowest CPI for the given CPU design which does not represent real programs. 204521 Digital System Architecture March 14, 2016 70 Computer Performance Measures : MIPS Rating (3/3) Under what conditions can the MIPS rating be used to compare performance of different CPUs? The MIPS rating is only valid to compare the performance of different CPUs provided that the following conditions are satisfied: 1 The same program is used (actually this applies to all performance metrics) 2 The same ISA is used 3 The same compiler is used (Thus the resulting programs used to run on the CPUs and obtain the MIPS rating are identical at the machine code level including the same instruction count) 204521 Digital System Architecture March 14, 2016 71 Compiler Variations, MIPS, Performance: An Example (1/2) For the machine with instruction classes: Instruction class CPI A 1 B 2 C 3 For a given program two compilers produced the following instruction counts: Code from: Compiler 1 Compiler 2 Instruction counts (in millions) for each instruction class A B C 5 1 1 10 1 1 The machine is assumed to run at a clock rate of 100 MHz 204521 Digital System Architecture March 14, 2016 72 Compiler Variations, MIPS, Performance: An Example (2/2) MIPS = Clock rate / (CPI x 106) = 100 MHz / (CPI x 106) CPI = CPU execution cycles / Instructions count n CPU clock cycles i 1 CPI C i i CPU time = Instruction count x CPI / Clock rate For compiler 1: – CPI1 = (5 x 1 + 1 x 2 + 1 x 3) / (5 + 1 + 1) = 10 / 7 = 1.43 – MIP Rating1 = 100 / (1.428 x 106) = 70.0 – CPU time1 = ((5 + 1 + 1) x 106 x 1.43) / (100 x 106) = 0.10 seconds For compiler 2: – CPI2 = (10 x 1 + 1 x 2 + 1 x 3) / (10 + 1 + 1) = 15 / 12 = 1.25 – MIPS Rating2 = 100 / (1.25 x 106) = 80.0 – CPU time2 = ((10 + 1 + 1) x 106 x 1.25) / (100 x 106) = 0.15 seconds 204521 Digital System Architecture March 14, 2016 73 Computer Performance Measures :MFLOPS (1/2) A floating-point operation is an addition, subtraction, multiplication, or division operation applied to numbers represented by a single or a double precision floating-point representation. MFLOPS, for a specific program running on a specific computer, is a measure of millions of floating point-operation (megaflops) per second: MFLOPS = Number of floating-point operations / (Execution time x 106 ) MFLOPS rating is a better comparison measure between different machines (applies even if ISAs are different) than the MIPS rating. – Applicable even if ISAs are different 204521 Digital System Architecture March 14, 2016 74 Computer Performance Measures :MFLOPS (2/2) Program-dependent: Different programs have different percentages of floatingpoint operations present. i.e compilers have no floating- point operations and yield a MFLOPS rating of zero. Dependent on the type of floating-point operations present in the program. – Peak MFLOPS rating for a CPU: Obtained using a program comprised entirely of the simplest floating point instructions (with the lowest CPI) for the given CPU design which does not represent real floating point programs. 204521 Digital System Architecture March 14, 2016 75 Other ways to measure performance (1) Use MIPS (millions of instructions/second) MIPS = Instruction Count Exec. Time x = 106 Clock Rate CPI x 106 MIPS is a rate of operations/unit time. Performance can be specified as the inverse of execution time so faster machines have a higher MIPS rating So, bigger MIPS = faster machine. Right? 204521 Digital System Architecture March 14, 2016 76 Wrong!!! 3 significant problems with using MIPS: – Problem 1: • MIPS is instruction set dependent. • (And different computer brands usually have different instruction sets) – Problem 2: • MIPS varies between programs on the same computer – Problem 3: • MIPS can vary inversely to performance! Let’s look at an example of why MIPS doesn’t work… 204521 Digital System Architecture March 14, 2016 77 A MIPS Example (1) Consider the following computer: Instruction counts (in millions) for each instruction class Code from: A B C Compiler 1 5 1 1 Compiler 2 10 1 1 The machine runs at 100MHz. Instruction A requires 1 clock cycle, Instruction B requires 2 clock cycles, Instruction C requires 3 clock cycles. n ! Note important CPI = formula! S CPU Clock Cycles CPIi x Ci i =1 = Instruction Count 204521 Digital System Architecture March 14, 2016 Instruction Count 78 A MIPS Example (2) count CPI1 = [(5x1) + (1x2) + (1x3)] x 106 (5 + 1 + 1) x 106 MIPS1 = CPI2 = cycles 100 MHz 1.43 cycles = 69.9 [(10x1) + (1x2) + (1x3)] x 106 MIPS2 = (10 + 1 + 1) x 106 100 MHz 1.25 204521 Digital System Architecture = 10/7 = 1.43 = 80.0 = 15/12 = 1.25 So, compiler 2 has a higher MIPS rating and should be faster? March 14, 2016 79 A MIPS Example (3) Now let’s compare CPU time: ! Note important formula! CPU Time = CPU Time1 = CPU Time2 = Instruction Count x CPI Clock Rate 7 x 106 x 1.43 100 x 106 12 x 106 x 1.25 100 x 106 = 0.10 seconds = 0.15 seconds Therefore program 1 is faster despite a lower MIPS! 204521 Digital System Architecture March 14, 2016 80