CS 1104 Help Session V Performance Analysis Colin Tan, ctank@comp.nus.edu.sg S15-04-15 Benchmarking • Your boss wants you to make a purchasing decision between two systems. – How will you decide which system is better for your company? • Benchmarks are programs that help you to decide – Measures various aspects of a computer’s performance • CPU speed • I/O speed • Etc. Pitfalls of Benchmarks • Benchmarks are sensitive to various hardware and software configurations: – For portability, benchmarks typically come in sourcecode form. • Execution time (and hence system “performance”) can be affected by how clever the compiler is (e.g. can the compiler do loop-index-transformations, inline expansions, loop unrolling, common sub-expression elimination, intelligent spills). • Execution time can also be affected by compiler “switches” – Switches control how the compiler produces the object code. – Can optimize for size » Produces small but slow code => seemingly poor performance » Produces big but fast code => seemingly good performance Pitfalls of Benchmarks – Cache Effects • Some benchmarks make use of tight loops • Extremely good performance on machines with large block sizes. • This is because the entire loop may be contained in a cache block, resulting in 100% instruction cache hits. • Likewise, benchmarks may give good results because data accesses are confined to a small portion of memory. • Can give unrealistic results, since real programs may not have tight loops at all, and may access data from many parts of memory. Pitfalls of Benchmarks – I/O Effects • I/O intensive benchmarks may perform well on machines with slow CPU but fast I/O. • CPU-bound benchmarks may perform well on machines with fast CPU but slow I/O. Pitfalls of Benchmarks • Moral: – Benchmarks chosen must reflect behavior of actual programs to be used. • E.g. Cache access patterns, I/O access patterns. – Standardize on a compiler • Don’t use slow, stupid and clunky Microsoft compiler on one machine and fast, intelligent GNU compiler on another. – Standardize on compiler switches • Don’t optimize for size on one machine (slow) and optimize for speed on another. Optional Reading Why Are Fast Programs Fat? • When a compiler is set to optimize for speed, it often produces large code: – Loops are unrolled. • Aim of loop unrolling is to generate more instructions for a program so that there is more opportunity for instruction scheduling and for superscalar execution. • Can lead to large code – See example on the next slide. Why Are Fast Programs Fat? Unrolling Example (Optional) • Original Code: for(i=0; i<100; i++) { a[i] = b[i] + c } • Unrolled Code for(i=0; i<100; i = i + 3) { a[i]=b[i]+c a[i+1]=b[i+1]+c a[i+2]=b[i+2]+c } Optional Reading Why Are Fast Programs Fat? • Short functions are in-lined – Function calls are replaced by the actual function code, removing the need to actually do a function call. – This saves time, as function calls are very expensive • Need to save the context (register values, etc.) of the caller to the stack. • Need to save the Program Counter value of the caller. • Need to do jump to called function’s code. • Need to restore caller context before jumping back to caller’s code. Why Are Fast Programs Fat? Function In-Lining Example (Optional Reading) • Original Code function f(int x) { int y; y = y + 3; return cos(y)+sin(y); } main() { int x1 = f(1); int x2 = f(2); int x3 = f(3); } Why Are Fast Programs Fat? Function In-Lining Example (Optional Reading) • In-lined Code: main() { int y1=1+3; int x1 = cos(y1)+sin(y); int y2=2+3; int x2 = cos(y2)+sin(y2); int y3=3+3; int x3=cos(y3)+sin(y3); } Amdahl’s Law • Suppose you had a program that runs in Y seconds, of which X seconds are caused by an instruction class C. • Suppose you wanted to improve Y by a certain percentage, by improving the running time of instructions in class C. • How much improvement must be made to class C’s execution time? Amdahl’s Law • TNEW = TC/speedup_c + TNC – Here TC is the time taken by the class C instructions, speedup_c is the improvement that you must make to class C instructions, and TNC is the timing of all the rest of the instructions. – TNEW is the new execution time. Amdahl’s Law Example • Suppose a program runs in 100 seconds, of which 80 seconds are taken up by multiply instructions. How much improvements must be made to multiply instruction execution timing in order to get improve overall execution time by 4 times? • What about improving overall execution time by 5 times? Solution 4-times Speedup • TC = 80 seconds, TNC = 100 - 80 = 20 seconds, TNEW = 100/4 = 25 seconds, speedup_c to be determined. • TNEW = TC/speedup_c + TNC • 25 = 80/speedup_c + 20 • speedup_c = 16. • Therefore multiply instructions must be sped up by 16x to get a 4x improvement in overall execution time. Solution 5-times Speedup • TC=80 seconds, TNC=100-80 = 20 seconds, TNEW = 100/5 = 20 seconds • Amdahl’s Law: 20 = 80/speedup_c + 20 0 = 80 / speedup_c 0 = 0 ?? • If the speedup_c is a strange answer (0 = 0), or if it is negative, then the new timing TNEW is not possible. In this case, it is not possible to improve execution time by 5x from 100s to 20s. Optimizing Strategy Making the Common Case Fast • Suppose your program runs in 100 seconds, and rotate operations take 5 seconds while add instructions take 75 seconds. Find the improvement in execution time if: – Rotates are improved by 4x – Adds are improved by 4x Optimizing Strategy Making the Common Case Fast • Improving Rotate Timings – TC = 5s, speedup_c = 4x, TNC = 100-5 = 95 TNEW = 5/4 + 95 = 96.25 seconds (Improvement: 3.75 seconds) • Improving Add Timings – TC = 75s, speedup_c = 4x, TNC = 100-75 = 25 TNEW = 75/4 + 25 = 43.75 seconds (Improvement: 56.25 seconds) Optimizing Strategy Making the Common Case Fast • The add instruction was used more frequently that the rotate instruction, and accounted for more of the execution time (75% vs. 5%) than the rotate instructions. • By speeding up the add instruction, we actually speed up the bulk of the execution time. • This results in better performance improvement than if we sped up the rotate instructions. Cycles Per Instruction • Each instruction takes a certain amount of time to execute – Time must be spent fetching the instruction, understanding what it means, fetching data, performing the operation on the data, and storing the results • When this time is measured in clock cycles, the unit of measure is the Cycles Per Instruction or CPI. Cycles Per Instruction • CPI may be divided into two categories: – Class CPIs • These are the CPI measurements for various classes of instructions. – E.g. for multiply instructions, add instructions, load/store instructions etc. • Class CPIs are affected by processor organization – Affected by things like how fast we can fetch data from the registers, how fast the ALU is, etc. • Different hardware implementations of the same processor may have different Class CPIs. – E.g. a pipelined implementation of a processor typically has lower class CPIs than a non-pipelined implementation. Cycles Per Instruction – Overall CPI • This is affected by the class CPIs, as well as by the compiler. • For a processor with 4 classes of instructions (A, B, C, D), the overall CPI is given by: CPIoverall= fa/N * CPIA + fB/N * CPIB + fC/N * CPIc + fD/N * CPID fX = # of instructions in program taken from class X, CPIX = CPI of class X, N is the total number of instructions in the program. • Usually the ratio fX/N is given for class X. • E.g. table may say that class X has a frequency of 0.05. This means that if the program has 100 instructions, 5 of the instructions come from class X • Overall CPI varies between programs – Each program would have different proportions (fX/N) for each class of instructions. Cycles Per Instructions Example 1 • Given the following data about a program P, find the overall CPI of the program: – – – – Class A, CPI 3, Frequency 0.1 Class B, CPI 4, Frequency 0.5 Class C, CPI 5, Frequency 0.2 Class D, CPI 6, Frequency 0.2 Overall CPI = 0.1 * 3 + 0.5 * 4 + 0.2 * 5 + 0.2 * 6 = 4.5 Cycles Per Instruction Improving Performance • An improvement in either overall or class CPIs will result in better performance, all else being equal. • Overall CPI – Overall CPI can be improved by using a better compiler, or by using better hardware design. • Better compiler improves overall CPI by having a higher frequency of fast instructions. • Better hardware design improves overall CPI by improving individual class CPIs. Cycles Per Instruction Example 2 • The program from the previous example was modified such that 40% of the total set of instructions come from class A, 40% from class B, 10% from class C and 10% from class D. Find the new overall CPI. Cycles Per Instruction Example 2 Overall CPI = 0.4 * 3 + 0.4 * 4 + 0.1 * 5 + 0.1 * 6 = 3.8 • What is the lower bound on the overall CPI? – The lower bound occurs when all instructions come from the fastest class. So the lower bound for our example is: 1.0 * 3 + 0.0 * 4 + 0.0 * 5 + 0.0 * 6 = 3.0 Cycles Per Instruction Improving Performance • Improving Class CPIs – Class CPIs are affected by hardware implementation. – Hence the only way to improve class CPIs is to improve the hardware design. – CANNOT BE ACHIEVED BY SOFTWARE MEANS! • Changing compiler etc will not improve class CPIs. Cycles Per Instruction Example 3 • In our Example 1, suppose we improved the CPI of class B instructions to 1.0. What will be the new overall CPI? Overall CPI = 0.1 * 3 + 0.5 * 1 + 0.2 * 5 + 0.2 * 6 = 3.0 • What is the new lower-bound on the overall CPI? – Again the lower-bound is achieved if all instructions came from the fastest class (here it is class B) Overall CPI = 0.0 * 3 + 1.0 * 1 + 0.0 * 5 + 0.0 * 6 = 1.0 Cycles Per Instruction Lower Bound for Class CPIs • What is the lower bound for class CPIs? – The answer, surprisingly, is 0! – Reason: • The hardware design (i.e. the organization) for certain types of instructions can actually be optimized such that execution of that instruction takes much less than 1 cycle. • This effectively gives that instruction a lower bound CPI of 0! Cycles Per Instruction Example 4 • Take the INTEL mov ax, bx instruction for example. – This instruction copies the contents of register bx into register ax. – Traditional data paths connect all the registers to the ALU inputs, and the ALU output to all the registers. – Electronic switches control which register will load its data onto the ALU, and which register the ALU result will be written to. Cycles Per Instruction Example 4 • Traditional implementation of the mov instruction: • • • • Read register AX - 1 cycle Pass through ALU, don’t do anything - 1 cycle Write to register BX - 1 cycle Total = 3 cycles (i.e. CPI for move instructions is 3) • The Read and Write operations both take 1 cycle, as they are synchronous operations that take place only at the rising edge of a clock cycle. • ALU pass-through also takes 1 cycle, to simplify timing design of the CPU, even if ALU does not do anything. Cycles Per Instruction Example 4 • New Implementation – 1. Connect registers directly to each other, as well as to ALU inputs (i.e. AX connects directly to BX) – 2. If a mov ax, bx instruction is encountered, close the direct connection between registers AX and BX. • This is fast, as it consists just of closing a transistor switch – 1/10th of a cycle? – 3. Perform asynchronous copy of AX • Independent of clock cycles => no need to wait for next rising edge. • This is again fast, typically about 2 transistor times. Around 2/10th of a cycle. – Total timing = 3/10th of a cycle => effectively 0 cycles lower bound! Pitfalls Of Performance Prediction • Trying to predict whether machine A is better than machine B based on single measurements is dangerous – Single measurements: This means that you consider only CPI or clock rate in determining who is faster. – This can lead to very unpredictable results. Pitfalls of Performance Prediction - CPI • CPI gives a good idea of performance of a program on a given processor. – Lower CPI means better performance, on a given processor. • The term “given processor” means that factors like clock speed remain constant! • However it is a fallacy to assume that low CPI = fast execution time! Pitfalls of Performance Prediction - Clock Frequency • Chip manufacturers are fond of quoting clock frequencies as an indication of speed. – E.g. Intel 500MHz Pentium II • Unfortunately increasing clock speeds can sometimes slow down a system rather than speeding it up! – All design requires compromise. – Increasing clock speed may compromise on CPI, leading to larger overall CPIs. This may actually give worse performance. Pitfalls of Performance Prediction - Execution Time • Execution time of a program is the only semireliable way of telling which machine is faster. – It is only semi-reliable, as it is still subject to the pitfalls of benchmarks mentioned earlier • A benchmark may run very fast on machine A and slowly on machine B, yet the actual applications that you use may be just the reverse! • This problem is caused by choosing inappropriate benchmarks. E.g. choosing processor-intensive benchmarks even though your actual applications are I/O intensive. Execution Time • To compute execution time: – 1. Compute the total number of clock cycles taken by the program: a) Usually we have to start off from an average (i.e. “overall”) CPI value CPI b) Find out how many instructions IC are executed. c) The total number of cycles taken is just IC x CPI – 2. Multiply by the clock period TCLK to get the execution time: • TCLK = 1/clock_rate, if clock_rate is specified in Hz. • Combining (1) and (2) • TEXEC = (CPI * IC) / clock_rate Execution Time Example 5 • In the previous four examples, given that each of them runs on a 500MHz machine, find the execution times. – TExec1 = 4.5IC/(500 x 10^6) seconds – TExec2 = 3.8IC/(500 x 10^6) seconds – TExec3 = 3.0IC/(500 x 10^6) seconds • Find the relative speedups between TExec1, TExec2 and TExec3. Execution Time Example 5 • Note that it is not possible to compare TExec2 vs. TExec1 or TExec3 – This is because a different compiler was used, and the instruction count IC for TExec2 may not be the same as the instruction count IC for TExec1 and TExec3! • Please read the question carefully before proceeding! – Speedup of example 3 over example 1: Speedup = TExex1/TExec3 = [4.5Ic/(500x10^6)]/[3.0Ic/(500 x 10^6)] = 1.5x speedup Execution Time Example 6 • For examples 1 and 3 above, suppose that the program in Example 1 runs on a computer with a 600 MHz clock, while the program in Example 3 runs on a computer with a 300 MHz clock. Which program runs faster, and by how much? Execution Time Example 6 • TExec1 = 4.5IC/(600 x 10^6) • TExec2 = 3.0IC/(300 x 10^6) • Speedup of TExec2 over TExec1 is: TExec1/TExec2 = [4.5IC/(600x10^6)] / [(3.0IC/(300x10^6)] = 0.60 • TExec1 is actually smaller than TExec2! • This is despite the fact that the CPI for example 1 was bigger! • Example of why CPI is not a reliable indication of which computer is faster. Execution Time Example 8 • Given the following information: Machine A Machine B 400 MHz 300 MHz Overall CPI Overall CPI of program of program = 4.5 = 2.0 Find out which machine is faster, and by how much. Assume that both programs are identical. Execution Time Example 8 • If both programs are identical, then the instruction count IC is the same. • Texec_A = 4.5IC/(400 x 10^6) • Texec_B = 2.0IC/(300 x 10^6) • Speedup = Texec_A/Texec_B = 1.6875 • Thus Texec_A > Texec_B even though the clock rate for machine A is faster! • Example of how clock rate is not a reliable estimate of relative performance. Instruction Frequencies A Pitfall • When we are computing overall CPIs, we must ALWAYS check that the instruction frequencies (if they are given in fractions like 0.1, etc.) add up to 1.0! • If they don’t, the final CPI must be normalized. Instruction Frequencies Example 9 • Given the following information, compute the overall CPI – – – – Class A, CPI 1.0, Frequency 0.2 Class B, CPI 2.0, Frequency 0.3 Class C, CPI 3.0, Frequency 0.1 Class D, CPI 4.0, Frequency 0.2 • Overall CPI = 0.2 * 1.0 + 0.3 * 2.0 + 0.1 * 3.0 + 0.2 * 4.0 = 1.9 • This answer is WRONG! Instruction Frequencies Example 9 • The answer is wrong because the frequencies do not add up to 1.0! – 0.2 + 0.3 + 0.1 + 0.2 = 0.8! NOT 1.0! • To get the correct CPI, divide by the frequency: – Correct CPI = Wrong CPI / Freq = 1.9 / 0.8 = 2.375 Summary • Benchmarks help us to choose which system is better for our needs. • Unfortunately benchmarks are subject to many pitfalls, and we must choose benchmarks that closely resemble our actual application load. • Amdahl’s Law implies that to get the best improvement in performance, we should make the common case fast. – The examples covered show that making uncommon cases fast produces little speedup. Summary • CPI is a measure of how many cycles each instruction takes to execute. • Two types of CPI – Class CPIs • These are the CPIs of instructions in individual classes. • Affected by hardware design – Changing the hardware design often changes the class CPIs. – E.g. we may shift from a serial adder to a parallel adder, resulting in fewer cycles required to add 2 numbers Summary • Two types of CPI – Overall (or average) CPI • Affected by compiler • Also affected by hardware organization changes – These will cause individual class CPIs to change, causing change in the overall CPI. • If the designer changes only the clock frequency (BUT NOT ANYTHING ELSE!), then both types of CPI will not be affected. Summary • Neither clock rate nor CPI are reliable measures of which computer is faster. • The only semi-reliable measure is execution time – However execution time is subject to the pitfalls of benchmarking. • If you choose benchmarks that are not representative of your expected work load, or if the actual work load ends up beign different from your expected workload, then the execution time of the benchmarks no longer form the basis for reliable comparisons.