Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier (UM) PERFORMANCE TRENDS Growth in Processor Performance since 1978. Logarithmic Scale! Moore’s Law • Gordon Moore – One of the founders of Intel – Famously predicted in 1960 that the transistor capacity of integrated circuits would double every 18-24 months. – Not really a law, but has largely held true. – Generally translates into increased performance, and decreased cost. Moore’s Law Exponential Growth How do we get to Performance? • Does more transistors really mean more performance? • Is it a one-to-one correlation? • How might transistors NOT correlate to increased performance? MEASURING PERFORMANCE A simple example • Say we have two computers. You know one is rated at 1GHz and another is rated at 800MHz. • Which computer has a higher performance? A simple example? • What do GHz and MHz even mean? • What else could differ about the machines? • What else could differ about the context of performance? THE SITUATION IS A COMPLEX ONE! First, Some Measure Theory • What is a measure? Formally? – A way of assigning numbers to the subsets of some set, which can be said (intuitively) to be the size of the set. – Measures require measurable spaces, and measurable sets. – Not all sets are measurable! Measurable Sets/Spaces • One reason a space or set may be unmeasurable is if it is ill-defined. Which Plane has a Higher Performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud Concorde BAC/Sud Concorde Douglas DC-8-50 Douglas DC8-50 0 100 200 300 400 0 500 Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud Concorde BAC/Sud Concorde Douglas DC-8-50 Douglas DC8-50 500 1000 Cruising Speed (mph) 4000 6000 8000 10000 Cruising Range (miles) Passenger Capacity 0 2000 1500 0 100000 200000 300000 400000 Passengers x mph Defining Performance • We can define performance in several ways. • Response time – How long does it take to accomplish a task? – We send input to a black box, and measure how long it takes to get out output. Defining Performance • We can define performance in several ways. • Throughput – How much work gets done during a certain amount of time? – Watch a system, count the number of jobs finished during a certain amount of time. Throughput Example • What is the fastest way you can think to deliver a large amount of data? • Never underestimate the throughput of a Mack Truck loaded with hard drives! What’s the Response time of our Truck? Response time as Execution Time • Start a program, wait for it to return results. Comparing Performance • Given the performance or execution time of a computer (A) and a different computer (B) running the same program, we can compare performance. Comparing Performance • Relative performance Why is Relative Performance Important? So How Do We Measure Performance • First let’s define performance: – Execution time • What is our measurable space? • What is our measurable set? Measuring Execution Time • CPU execution time • Wall clock time • How might these differ? Measuring Execution Time • Clock cycles • Instruction count Clock Cycles • Clock period – duration of a clock cycle • Clock frequency – number of cycles per second Clock period Clock (cycles) Data transfer and computation Update state CPU Time • We can improve performance by – Reducing the number of clock cycles – Increasing clock rate – Often there is a trade-off CPU Time CPU Clock Cycles Clock Cycle Time CPU Clock Cycles Clock Rate CPU Example • Computer A: 2 GHz clock, 10s CPU time • Computer B – Aim for 6s CPU time. If you increase clock speed, the number of cycles increase by 1.2x. Break Into Groups Find the necessary clock rate for Computer B CPU Example • Computer A: 2 GHz clock, 10s CPU time • Computer B – Aim for 6s CPU time. If you increase clock speed, the number of cycles increase by 1.2x. Clock Rate B Clock CyclesB 1.2 Clock CyclesA CPU Time B 6s Clock CyclesA CPU Time A Clock Rate A 10s 2GHz 20 109 1.2 20 109 24 109 Clock Rate B 4GHz 6s 6s Instruction Count and CPI • Instruction count – How many instructions the program has • Depends on the ISA and compiler • CPI – Cycles per instruction • Determined by hardware Clock Cycles Instruction Count Cycles per Instruction CPU Time Instruction Count CPI Clock Cycle Time Instruction Count CPI Clock Rate CPI Example • • • • Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster? By how much? Break Into Groups CPI Example • • • • Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster? By how much? CPU Time Instruction Count CPI Cycle Time A A A I 2.0 250ps I 500ps A is faster… CPU Time Instruction Count CPI Cycle Time B B B I 1.2 500ps I 600ps CPU Time B I 600ps 1.2 CPU Time I 500ps A …by this much CPI Detail • Sometimes different instructions take differing amounts of time. n Clock Cycles (CPI i Instruction Count i ) i1 • Often we will want to weight by instruction proportion in a program. n Clock Cycles Instruction Count i CPI CPI i Instruction Count i1 Instruction Count Relative frequency CPI Example • Have instruction classes A, B, and C. Two was to compile our code: Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 Give the average CPI for each program CPI Example Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 Sequence 1: IC = 5 Clock Cycles = 2×1 + 1×2 + 2×3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 Clock Cycles = 4×1 + 1×2 + 1×3 =9 Avg. CPI = 9/6 = 1.5 Performance Summary • Performance depends on – Algorithm: affects IC, possibly CPI – Programming language: affects IC, CPI – Compiler: affects IC, CPI – Instruction set architecture: affects IC, CPI, Tc Instructions Clock cycles Seconds CPU Time Program Instruction Clock cycle So Why Don’t We Have 1THz Computers? The Power Wall • In CMOS IC technology Pow er Capacitive load Voltage2 Frequency ×30 5V → 1V ×1000 The Power Wall • Suppose a new CPU has – 85% of capacitive load of old CPU – 15% voltage and 15% frequency reduction Pnew Cold 0.85 (Vold 0.85)2 Fold 0.85 4 0.85 0.52 2 Pold Cold Vold Fold The power wall We can’t reduce voltage further We can’t remove more heat How else can we improve performance? Multiprocessors • Multicore microprocessors – More than one processor per chip • Requires explicitly parallel programming – Compare with instruction level parallelism • Hardware executes multiple instructions at once • Hidden from the programmer – Hard to do • Programming for performance • Load balancing • Optimizing communication and synchronization Amdahl’s Law • Improving an aspect of a computer and expecting a proportional improvement in overall performance Timproved Taffected Tunaffected improvemen t factor Example: multiply accounts for 80s/100s How much improvement in multiply performance to get 5× overall? Break into Groups! Amdahl’s Law • Improving an aspect of a computer and expecting a proportional improvement in overall performance Timproved Example: multiply accounts for 80s/100s Taffected Tunaffected improvemen t factor How much improvement in multiply performance to get 5× overall? 80 Can’t be done! 20 20 n Corollary: make the common case fast PROBLEM SETS Consider the following processors, P1, P2, and P3 executing the same instruction set with clock rates and CPI as indicated 1. 2. 3. Processor Clock Rate CPI P1 3 GHz 1.5 P2 2.5 GHz 1.0 P3 4 GHz 2.2 Which processor has the highest performance in terms of instructions per second? If the processors each execute a program in 10s, find the number of cycles and the number of instructions We are trying to reduce the execution time by 30% but this leads to an increase in CPI of 20%. What clock rate should we have to get this reduction? Consider a computer running code with four main routines, A, B, C, and D. Routine A 40s 1. 2. 3. Routine B 90s Routine C 60s Routine D 20s Total Time 210s How much is the total time reduced if the time for Routine A is reduced by 20%? How much is the time for Routine B reduced if the total time is reduced by 20%? Can the total time be reduced by 20% by only reducing the time for Routine D? Consider a computer running code with four main routines, A, B, C, and D. Routine A Routine B Routine C Routine D Total Time Exec Time 40s 90s 60s 20s 210s Instructions 50x10^6 110x10^6 80x10^6 16x10^6 - Avg CPI 1 1 4 2 - 1. 2. 3. How much is the total time reduced if the time for Routine A is reduced by 20%? How much is the time for Routine B reduced if the total time is reduced by 20%? Can the total time be reduced by 20% by only reducing the time for Routine D? Consider a computer running code with four main routines, A, B, C, and D. Routine A Routine B Routine C Routine D Total Time Exec Time 40s 90s 60s 20s 210s Instructions 50x10^6 110x10^6 80x10^6 16x10^6 - Avg CPI 1 1 4 2 - 1. 2. 3. How much must we improve the CPI of Routine A if we want the program to run twice as fast? How much must we improve the CPI of Routine C if we want the program to run twice as fast? How much is the execution time improved if the CPI of routines A and B are reduced by 40%, and the CPI of routines C and D are reduced by 30%? WRAP UP For next time • Read Chapter 2, Sections 2.1 – 2.3 • Finish Lab 0 by next lab session.