1 4.1 Performance and Cost/performance Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft or are averages of cited range of values. 2 Figure 4.1 Performance improvement as a function of cost. 3 4.2 Defining Computer Performance Performance = 1 / Execution time Throughput: amount of work performed per unit time. It can be measured as the number of processes per unit time. Tournaround time: the average time from the moment that a job is submitted until the moment it is completed. It measures how long the average user has to wait for output. Response time: In an interactive systems, the time from when a user press an Enter or clicks a mouse until the system delivers a final response. To filter out variable factor (e.g., scheduling, interrupts, I/O delay) Performance = 1 / CPU Execution time 4 Figure 4.2 Pipeline analogy shows that imbalance between processing power and I/O capabilities leads to a performance bottleneck. 5 CPU execution time = Instructions × (Cycles per instruction ) × (seconds per cycle) = Instructions × CPI / (Clock rate) (CPI: cycles per instruction) Performance comparison (Performance of M1) / (Performance of M2) = (Execution time of M2) / (Execution time of M1) 6 4.3 Performance Enhancement and Amdahl’s Law Amdahl’s law s = 1 / (f+(1-f) / p) ≤ min (p, 1/f) f: time for instructions that cannot be parallelized. p: speed-up (by parallel computer or redesign CPU or algorithm) s: overall speedup Study Example 4.1 7 Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast. 8 4.4 Performance Measurement vs Modeling Figure 4.5 Running times of six programs on three machines. 9 Benchmarks: real or synthetic programs that are selected for comparative evaluation of machine performance. SPEC: Standard Performance Evaluation Corporation Table 4.2 Summary of SPEC CPU2000 benchmark suite characteristics. 10 Figure 4.6 Example graphical depiction of SPEC benchmark results. Study Example 4.3 11 Performance Estimation System’s peak performance: expressed in instructions per second. (MIPS, MFLOPS) Average CPI (class - i fraction) (class - i CPI) All inctruction classes CPU execution time Instructions Average CPI / (clockrate) Table 4.3 Usage frequency, in percentage, for various instruction classes in four representative applications. 12 Example 4.4 CPI and IPS calculation Solution a. For M1, assume all instructions are class I instructions, Peak performance of M1 = 1 / (Avg. CPI × Clock time) = 600 / 2.0 = 300MIPS Notice: Units for Average CPI and clock time are second. For M2, assume all instructions are class N instructions, Peak performance of M2 = 1 / (Avg. CPI × Clock time) = 500 / 2.0 = 250MIPS 13 b. Average CPI for M1=5.0×0.25+2.0×0.25+2.4×0.5=2.95 Average CPI for M2=4.0×0.25+3.8×0.25+2.4×0.5=2.95 c. 1. Average CPI=2.5×0.25+2.0×0.25+2.4×0.5=2.325 MIPS for option 1 = 600/2.325 = 258 2. Average CPI=5.0×0.25+1.2×0.25+2.4×0.5=2.75 MIPS for option 1 = 600/2.75 = 218 3. MIPS for option 3 = 750/2.95=254. Conclusion: Option 1 has the greatest impact d. With larger cache, cache miss rate is reduced 2% (from 5% to 3%), that is all CPIs are reduced 10×2%=0.2ns (cache miss imposes 10 cycle penalty) Average CPI M1=(5.0-0.2)×0.25+(2.0-0.2)×0.25+(2.4-0.2)×0.5=2.75 This option is comparable to option 2 in c. e. Average CPI for M1= 5.0×x+2.0×y+2.4×(1-x-y)=2.6x-0.4y+2.4 Average CPI for M2= 4.0×x+3.8×y+2.0×(1-x-y)=2x+1.8y+2 We need 600/(2.6x-0.4y+2.4) > 500/(2x+1.8y+2) => 2.56y > 0.2x That is, x/y < 12.8, M1 runs faster than M2 for the given task. 14 Example 4.5 MIPS rating can be misleading a. Runtime for the output of compiler 1= (600M+400M)/109= 1.4s Runtime for the output of compiler 2 = (400M+400M)/109= 1.2s Compiler 2 is faster. b. Code produced by compiler 2 is 1.4/1.2= 1.17 times as faster as that of compiler 1. c. Average CPI for compiler 1 = (600M×1+400M×2)/1000M=1.4 Average CPI for compiler 2 = (400M×1+400M×2)/800M=1.5 MIPS rating of compiler 1=1000/1.4=714 MIPS rating of compiler 2=1000/1.5=667 Compiler 1 is faster 15 4.5 Reporting Computer Performance Table 4.4 Measured or estimated execution times for three programs. Wrong method (arithmetic mean) Speedup of Y over X=(0.1+10.0+10.0)/3=6.7 (1) Speedup of X over y=(10.0+0.1+0.10)/3=3.3 (contradictory with (1)) Total time comparison: correct if they are run the same number of times. Geometric mean: Speedup of Y over X=(0.1×10.0×10.0)1/3=2.15 (2) Speedup of X over y=(10.0×0.1×0.10)1/3=0.46 (consistent with (2)) 16 Example 4.6 Table 4.3 Usage frequency, in percentage, for various instruction classes in four representative applications. 17 Answer: a. CPI for data compression application on M1=0.25×4.0+0.32×1.5+0.16×1.2+0×6.0+0.19×2.5+0.08×2.0 =2.31 CPI for data compression application on M2=2.54 CPI for nuclear reactor simulation application on M1=3.94 CPI for nuclear reactor simulation application on M2=2.89 b. Because the programs and clock rates are the same, speedup ratios is given by the ratio of CPIs. Data compression performance speed up (M2/M1)= 2.31/2.54 = 0.91 nuclear reactor simulation performance speed up (M2/M1)= 3.94/2.89 = 1.36. c. Overall performance advantage of M2 over M1 is (0.91×1.36)1/2=1.11 18 4.6 The Quest for High Performance Figure 4.7 Exponential growth of supercomputer performance [Bell92]. 19 Figure 4.8 Milestones in the Accelerated Strategic Computing Initiative (ASCI) program, sponsored by the U.S. Department of Energy, with extrapolation up to the PFLOPS level. 20 Problem 4.5 21 Problem 4.12 22