Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili Reading • Section 1.6 • Practice Problems: Module 5 – 20, 21, 27 (2) Understanding Performance • Algorithm Determines number of operations executed • Programming language, compiler, architecture Determine number of machine instructions executed per operation Instruction Set Architecture • Processor and memory system Determine how fast instructions are executed • I/O system (including OS) Determines how fast I/O operations are executed (3) Defining Performance • Which airplane has the best performance? Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud Concorde BAC/Sud Concorde Douglas DC-8-50 Douglas DC-8-50 0 200 400 600 0 200 400 600 800 1E+ 0 0 0 0 04 Cruising Range (miles) 0 1E+0 2E+0 3E+0 4E+0 5 5 5 5 Passengers x mph Passenger Capacity Boeing 777 Boeing 777 Boeing 747 Boeing 747 BAC/Sud Concorde BAC/Sud Concorde Douglas DC-8-50 Douglas DC-8-50 0 500 1000 Cruising Speed (mph) 1500 (4) Metrics • Response time (latency) How long it takes to do a task • Throughput Total work done per unit time o e.g., tasks/transactions/… per hour Trading throughput vs. latency • Energy/Power Measure of work being performed Increases with clock frequency/voltage Determines temperature Is affected by temperature (5) Relative Performance • “X is n time faster than Y” Performance X Performance Y Execution time Y Execution time X n Example: time taken to run a program 10s on A, 15s on B Execution TimeB / Execution TimeA = 15s / 10s = 1.5 So A is 1.5 times faster than B (6) CPU Clocking • Operation of digital hardware governed by a constant-rate clock Our cycle time Clock period Clock (cycles) Data transfer and computation Update state Clock period: duration of a clock cycle e.g., 250ps = 0.25ns = 250×10–12s Clock frequency (rate): cycles per second e.g., 4.0GHz = 4000MHz = 4.0×109Hz (7) CPU Time CPU Time CPU Clock Cycles Clock Cycle Time CPU Clock Cycles Clock Rate • Performance improved by Reducing number of clock cycles Increasing clock rate Hardware designer must often trade off clock rate against cycle count (8) CPU Time Example • Computer A: 2GHz clock, 10s CPU time • Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 × clock cycles • How fast must Computer B clock be? Clock RateB = Clock CyclesB 1.2 ´ Clock Cycles A = CPU TimeB 6s Clock Cycles A = CPU Time A ´ Clock Rate A = 10s ´ 2GHz = 20 ´109 1.2 ´ 20 ´109 24 ´109 Clock RateB = = = 4GHz 6s 6s (9) Instruction Count and CPI Clock Cycles =Instruction Count ´Cycles per Instruction • Instruction Count for a program Determined by program, ISA and compiler • Average cycles per instruction Determined by CPU hardware If different instructions have different CPI o Average CPI affected by instruction mix (10) Cycles and Instructions time • Multiplication takes more time than addition • Floating point operations take longer than integer ones • Accessing memory takes (in general) more time than accessing registers • Important point: changing the cycle time often changes the number of cycles required for various instructions (more later) (11) Program Execution time Number of instruction classes én ù ExecutionTime = ê å Ci ´CPIi ú ´cycle_time êëi=1 úû ~= Instruction_count * CPIavg * clock_cycle_time algorithms/compiler architecture technology n æ Clock Cycles Instruction Counti ö CPIavg = = åçCPIi ´ ÷ Instruction Count i=1 è Instruction Count ø Relative frequency (12) CPI Example • Computer A: Cycle Time = 250ps, CPI = 2.0 • Computer B: Cycle Time = 500ps, CPI = 1.2 • Same ISA • Which is faster, and by how much? CPU Time = Instruction Count ´CPI ´Cycle Time A A A A is faster… = I´ 2.0´ 250ps = I´500ps CPU Time = Instruction Count ´CPI ´Cycle Time B B B = I´1.2´500ps =I´ 600ps CPU Time B = I´ 600ps =1.2 …by this much CPU Time I´500ps A (13) CPI Example • Alternative compiled code sequences using instructions in classes A, B, C Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1 Sequence 1: IC = 5 Clock Cycles = 2×1 + 1×2 + 2×3 = 10 Avg. CPI = 10/5 = 2.0 Sequence 2: IC = 6 Clock Cycles = 4×1 + 1×2 + 1×3 =9 Avg. CPI = 9/6 = 1.5 Example: (14) Performance Summary The BIG Picture Instructions Clock cycles Seconds CPU Time Program Instruction Clock cycle • Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc (15) SPEC CPU Benchmark • Programs used to measure performance Supposedly typical of actual workload • Standard Performance Evaluation Corp (SPEC) Develops benchmarks for CPU, I/O, Web, … • SPEC CPU2006 Elapsed time to execute a selection of programs o Negligible I/O, so focuses on CPU performance Normalize relative to reference machine Summarize as geometric mean of performance ratios o CINT2006 (integer) and CFP2006 (floating-point) n n Execution time ratio i i1 (16) Other Benchmark Suites • Report performance metrics for execution on target platforms Designed to assess how well the platforms function in specific domains • Examples Media Bench - Multimedia EEMBC – Embedded systems Rodinia, Parboil: For GPU Systems SPECWeb, SPECJbb – Enterprise systems Many more…… (17) Pitfall: Amdahl’s Law • Improving an aspect of a computer and expecting a proportional improvement in overall performance Timproved Taffected Tunaffected improvemen t factor Example: multiply accounts for 80s/100s How much improvement in multiply performance to get 5× overall? 80 20 20 n Can’t be done! Corollary: make the common case fast (18) Amdahl’s Law • Speed-up = Exec_timeold / Exec_timenew = 1 f affected (1 f ) P • Performance improvement from using faster mode is limited by the fraction the faster mode can be applied. Told f (1 - f) Tnew (1 - f) f/P (19) Concluding Remarks • Cost/performance is improving Due to underlying technology development • Hierarchical layers of abstraction In both hardware and software • Instruction set architecture The hardware/software interface • Execution time: the best performance measure • Power is a limiting factor Use parallelism to improve performance (20) Study Guide • Practice problems provided on the class website (21) Glossary • Amdahl’s Law • Latency • Benchmarks • SPEC CPU • CPI (cycles per instruction) • Throughput • CPU Time (22)