Week 2, Lecture 1 Chapter 2: Performance Issues Dr. Qurban Ali , EE Department 1 Diminishing Returns o Internal organization of processors is complex n Can get a great deal of parallelism n Further significant increases likely to be relatively modest o Benefits from cache are reaching limit o Increasing clock rate runs into power dissipation problem n Some fundamental physical limits are being reached Dr. Qurban Ali , EE Department 2 Microprocessor Speeding up Techniques built into contemporary processors include: Pipelining Branch prediction • Processor moves data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously • Processor looks ahead in the instruction code fetched from memory and predicts which branches, or groups of instructions, are likely to be processed next Data flow analysis • Processor analyzes which instructions are dependent on each other’s results, or data, to create an optimized schedule of instructions Speculative execution • Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations, keeping execution engines as busy as possible Performance Improvement Example: Increased Cache Capacity o Typically two or three levels of cache between processor and main memory o Chip density increased n More cache memory on chip o Faster data access o Pentium chip devoted about 10% of chip area to cache o Pentium 4 devotes about 50% Dr. Qurban Ali , EE Department 4 Intel Microprocessor Performance Dr. Qurban Ali , EE Department 5 Summary - Improvements in Chip o Increase hardware speed of processor n Fundamentally due to shrinking logic gate size o More gates, packed more tightly, increasing clock rate o Propagation time for signals reduced o Increase size and speed of caches n Dedicating part of processor chip o Cache access times drop significantly o Change processor organization and architecture n Increase effective speed of execution n Core technology, Parallelism 6 Performance Assessment - Clock Speed? o Key parameters n Performance, cost, size, security, reliability, power consumption o System clock speed in Hz or multiples of n Clock rate, clock cycle, clock tick, cycle time n Signals in CPU take time to settle down to 1 or 0 o Operations need to be synchronised n Instruction execution in discrete steps - usually require multiple clock cycles per instruction n So, clock speed is not the whole story Dr. Qurban Ali , EE Department 7 Performance Assessment -Instruction Execution Rate o Millions of instructions per second (MIPS) o It is expressed as: !" ' ππΌππ πππ‘π = #$%&! = ()!$%&! where Ic is instruction count, T and f relate to frequency of the clock, and CPI is cycles per second and defined as: ∑$!"# πΆππΌ! π₯πΌ! πΆππΌ = πΌπ where CPIi is the number of cycles per instruction i and Ii is the number of executed instructions of type i. o Millions of floating point instructions per second(MFLOPS) n Heavily dependent on instruction set, compiler design, processor implementation, cache & memory hierarchy ππ. ππ ππ₯πππ’π‘ππ πππππ‘πππ πππππ‘ ππππππ‘ππππ ππ π πππππππ ππΉπΏπππ πππ‘π = 8 πΈπ₯πππ’π‘πππ ππππ π₯ 10! Dr. Qurban Ali , EE Department CPI Example o A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A, Class B, and Class C, and they require one, two, and three cycles (respectively). The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. What is the CPI for each sequence? Which sequence will be faster? How much? Dr. Qurban Ali, EE Department. 9 MIPS Example o Two different compilers are being tested on a 100 MHz machine with three different classes of instructions: Class A, Class B, and Class C, which require one, two, and three cycles (respectively). Both compilers are used to produce code for a large piece of software. The first compiler's code uses 5 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. The second compiler's code uses 10 million Class A instructions, 1 million Class B instructions, and 1 million Class C instructions. Which sequence will be faster according to CPI and MIPS? Dr. Qurban Ali, EE Department. 10 Another Example o Our favorite program runs in 10 seconds on computer A, which has a 400 MHz clock. We are trying to help a computer designer build a new machine B, that will run this program in 6 seconds. The designer can use new (or perhaps more expensive) technology to substantially increase the clock rate, but has informed us that this increase will affect the rest of the CPU design, causing machine B to require 1.2 times as many clock cycles as machine A for the same program. What clock rate should we tell the designer to target?“ o “Think” Questions o Does doubling the clock rate double the performance? o Can a machine with a slower clock have better performance? Dr. Qurban Ali, EE Department. 11 Performance Factors and System Attributes Ic p Instruction set architecture X X Compiler technology X X Processor implementation Cache and memory hierarchy m k t X X X X X Table 2.9 Table 2.9 is a matrix in which one dimension shows the five performance factors and the other dimension shows the four system attributes. An x in a cell indicates a system attribute that affects a performance factor. p: number of cycles to decode and execute; m: number of memory references; Ic: instruction count; k: ratio between memory cycle time and processor cycle time; τ: 1/f 12 Dr. Qurban Ali , EE Department Desirable Benchmark Characteristics Written in a high-level language, making it portable across different machines Representative of a particular kind of programming style, such as system programming, numerical programming, or commercial programming Can be measured easily Has wide distribution System Performance Evaluation Corporation (SPEC) o Benchmark suite n A collection of programs, defined in a high-level language n Attempts to provide a representative test of a computer in a particular application or system programming area o SPEC n An industry consortium n Defines and maintains the best known collection of benchmark suites n Performance measurements are widely used for comparison and research purposes SPEC CPU2006 o Best known SPEC benchmark suite o Industry standard suite for processor intensive applications o Appropriate for measuring performance for applications that spend most of their time doing computation rather than I/O o Consists of 17 floating point programs written in C, C++, and Fortran and 12 integer programs written in C and C++ o Suite contains over 3 million lines of code n Fifth generation of processor intensive suite from SPEC n Speed and Rate Metric o Single task and throughput SPEC Speed Metric o Single task o Base runtime defined for each benchmark using reference machine o Results are reported as ratio of reference time to system run time n Trefi execution time for benchmark i on reference machine n Tsuti execution time of benchmark i on test system o Overall performance calculated by averaging ratios for all 12 integer benchmarks n Use geometric mean n Appropriate for normalized numbers such as ratios Dr. Qurban Ali , EE Department 16 SPEC Rate Metric o Measures throughput or rate of a machine carrying out a number of tasks o Multiple copies of benchmarks run simultaneously n Typically, same as number of processors o Ratio is calculated as follows: n Trefi reference execution time for benchmark i n N number of copies run simultaneously n Tsuti elapsed time from start of execution of program on all N processors until completion of all copies of program n Again, a geometric mean is calculated Dr. Qurban Ali , EE Department 17 Amdahl’s Law o Gene Amdahl o Potential speed up of program using multiple processors o Concluded that: n Code needs to be parallelizable n Speed up is bound, giving diminishing returns for more processors o Task dependent n Servers gain by maintaining multiple connections on multiple processors n Databases can be split into parallel tasks Dr. Qurban Ali , EE Department 18 Amdahl’s Law Formula o For program running on single processor n Fraction f of code infinitely parallelizable with no scheduling overhead -- (1-f) of code inherently serial n T is total execution time for program on single processor n N is number of processors exploiting parallel code Speedup= π¬ππππππππ ππππ πππππππ ππππππππππ π¬ππππππππ ππππ ππππ ππππππππππ o Conclusions n f small, parallel processors have little effect n N ->∞, speedup bound by 1/(1 – f) o Diminishing returns for using more processors Dr. Qurban Ali , EE Department 19 Amdahl’s Law An Example o Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster? o How about making it five times faster? Execution time after improvement = Execution time unaffected+(Execution time affected/Amount of improvement) Dr. Qurban Ali, EE Department. 21 Little’s Law o A fundamental and simple relation with broad applications is Little’s law n can be applied to almost any system that is statistically in steady state, and in which there is no leakage o It applies to queuing system n If server is idle, an item is served immediately, otherwise an arriving item joins a queue n There can be a single queue for a single server or for multiple servers, or multiples queues with one being for each of multiple servers o Average number of items in a queuing system equals the average rate at which items arrive multiplied by the time that an item spends in the system n Relationship requires very few assumptions n Because of its simplicity and generality it is extremely useful