Computer Systems Organization: Lecture 4 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, UCB] ENEE350 Performance Metrics Purchasing perspective given a collection of machines, which has the - best performance ? - least cost ? - best cost/performance? Design perspective faced with design options, which has the - best performance improvement ? - least cost ? - best cost/performance? Both require basis for comparison metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors ENEE350 Defining (Speed) Performance Normally interested in reducing Response time (aka execution time) – the time between the start and the completion of a task - Important to individual users Thus, to maximize performance, need to minimize execution time performanceX = 1 / execution_timeX If X is n times faster than Y, then performanceX execution_timeY -------------------- = --------------------- = n performanceY execution_timeX Throughput – the total amount of work done in a given time - Important to data center managers ENEE350 Decreasing response time almost always improves throughput Performance Factors 1. 2. 3. ECE244 Recall: Sequential Systems Need Synchronizing Clocks A Computer is a Sequential System and has a Clock Each Instruction Takes up a few Clock Cycles to Execute CPU execution time = # CPU clock cyclesx clock cycle time for a program for a program or CPU execution time = #------------------------------------------CPU clock cycles for a program for a program clock rate Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program ENEE350 Review: Machine Clock Rate Clock rate (MHz, GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate ENEE350 1 nsec clock cycle => 1 GHz clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate Clock Cycles per Instruction Not all instructions take the same amount of time to execute (different number of clock cycles in each instruction). For example MUL takes more cycles than Add # CPU clock cycles # Instructions Average clock cycles = for a program x for a program per instruction Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same ISA CPI ENEE350 CPI for this instruction class A B C 1 2 3 THE Performance Equation Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle or CPU time = Instruction_count x CPI ----------------------------------------------clock_rate These equations separate the three key factors that affect performance ENEE350 THE Performance Equation Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Instruction Count: Depends on the kind of instructions supported by the Architecture. For example a Multiply operation in C Language could be represented as a sequence of Adds in Assembly code, but the number of instructions would be quite a lot. Having a dedicated Mul instruction reduces the total number of instructions in the program ENEE350 THE Performance Equation Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle CPI: Depends on how complicated the instructions that are. More complex instructions need more clocks to execute. For example Mul instruction in MIPS takes more clocks than Add instruction in MIPS Hence if the Compiler and Assembler choose more complex instructions then they will increase the CPI but may reduce the total number of instructions ENEE350 Computing the Effective CPI Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle Given a specific Computer Architecture (MIPS for instance), each Instruction i can be associated with the number of clocks that it Needs Ci. Given a C/Java program, the compiler and assembler decide which instructions from the available instruction set to choose, this affects both the number of instructions and the CPI. Let us suppose that they end up choosing ICi number instructions from an instruction i. Then the effective CPI becomes (here n is the total number of instructions) Overall effective CPI = ENEE350 (Ci x ICi)/n Computing the Effective CPI Hence the effective CPI depends on ENEE350 The kind of instructions (instruction set) supported by the Architecture The choice of instructions from this instruction set by the compiler and assembler Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Algorithm Programming language Compiler ISA Instruction Set Processor organization Technology ENEE350 Instruction_ count CPI clock_cycle X X X X X X X X X X X X A Simple Example Op Freq CPIi Freq x CPIi ALU 50% 1 .5 .5 .5 .25 Load 20% 5 1.0 .4 1.0 1.0 Store 10% 3 .3 .3 .3 .3 Branch 20% 2 .4 .4 .2 .4 2.2 1.6 2.0 1.95 = How much faster would the machine be if a better architecture reduced the average load time to 2 cycles? CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster How does this compare with using branch prediction to shave a cycle off the branch time? CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster What if two ALU instructions could be executed at once? CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster ENEE350 Another Example Let us suppose that the ISA has 3 kinds of instructions A:CPI=1 B:CPI=2 C:CPI=3 Let the Compiler/Assembler generate 2 kinds of Codes Code 1: Has 2 from A, 1 from B and 2 from C Code 2: Has 4 from A, 1 from B and 1 from C Which is Betters Code 1= Total Number of Clocks = 2x1 + 1x2 + 2x3 = 10 Code 2= ENEE350 4x1 + 1x2 + 1x3 = 9 Another Example Let us suppose that the ISA has 3 kinds of instructions A:CPI=1 B:CPI=2 C:CPI=3 Let the Compiler/Assembler generate 2 kinds of Codes Code 1: Has 2 from A, 1 from B and 2 from C Code 2: Has 4 from A, 1 from B and 1 from C Which is Betters Code 1= Total Number of Clocks = 2x1 + 1x2 + 2x3 = 10 Code 2= 4x1 + 1x2 + 1x3 = 9 CODE 1 EXECUTION TIME = 10 x CLOCK CYCLE CODE 2 EXECUTION TIME = 9 x CLOCK CYCLE Therefore Code 2 is faster even though it has more instructions ENEE350 SPEC Benchmarks www.spec.org Integer benchmarks gzip compression vpr FPGA place & route gcc GNU C compiler mcf Combinatorial optimization crafty Chess program parser Word processing program eon Computer visualization perlbmk perl application gap vortex bzip2 twolf ENEE350 Group theory interpreter Object oriented database compression Circuit place & route FP benchmarks wupwise Quantum chromodynamics swim Shallow water model mgrid Multigrid solver in 3D fields applu Parabolic/elliptic pde mesa 3D graphics library galgel Computational fluid dynamics art Image recognition (NN) equake Seismic wave propagation simulation facerec Facial image recognition ammp Computational chemistry lucas Primality testing fma3d Crash simulation fem sixtrack Nuclear physics accel apsi Pollutant distribution Example SPEC Ratings ENEE350