COURSE NOTES: ECE8405 COMPUTER ARCHITECTURE Spring 2004 Ed Char 1 1. INTRODUCTION Summary: Hierarchy as a way of “controlling” complexity, overview. Architecture is at the heart of computer engineering Computers are driving technology, the economy, entertainment and society. Advances are coming at a breakneck pace: Moore’s Law: processors and memory are doubling performance every 1.5 years! (Due to technology AND organization). Most nontrivial digital systems are based on computers. The computer has become the most complicated (centralized) humanmade object, with hundreds of millions of identifiable parts. How can such a complicated system work reliably? By… Hierarchical organization (object-oriented design) makes it possible Many different systems have to come together to allow not only functionality, but high PERFORMANCE metrics. Central is the instruction set- the key component of the hardwaresoftware interface. This is the set of instructions implemented by the processor, and the bit patterns that encode them. Let’s consider the major component categories of this hierarchy: Networking Operating Systems High-level languages and compilers Assembly-language and assemblers, linkers, loaders | INSTRUCTION SET | Architecture (registers, datapath and ALU) Systems design (CPU, memory, I/O, exceptions) Gate-level design, simulation VLSI design and validation SOFTWARE HARDWARE … not to mention … Performance evaluation Chip and system packaging Clock speeds, CPU size, etc ECONOMICS 2 Standards define the interface between layers: Like in OOP, a public definition hides the complexity at a given layer e.g. Network protocols, HLL language syntax, O.S. calls, the instruction set of a microprocessor, inputs/outputs of a VLSI circuit module. A “simple” interface specification hides the underlying complexity. Allows designers at any level to work at a reasonable complexity. Permits design changes and migration to different platforms. Think of computer engineering from an object-oriented perspective To conclude this train of thought, let’s consider the design of a new uP: The design of a new high-performance processor requires: Fast time to market to leapfrog competition Innovate using newest and fastest VLSI fabrication line New instruction set design, BUT Instruction-set choices influences performance at all other levels! (e.g. OS, HLL compilers may be inefficient if certain instructions not included, but hardware implementation of certain instructions may require large VLSI “real-estate”). SO IT’S AN OPTIMIZATION PROBLEM AT ALL LEVELS! What actually happens is that the WHOLE PROJECT (instruction set, compilers, O.S., hardware) is SIMULATED over and over, until the best set of compromises are reached that give best performance at reasonable cost. Only then is the processor committed to silicon. Some processors have worked the first time they were integrated, others have been sold to the public with serious bugs…. Major processors today: 80x86/Pentium/K6, PowerPC, DEC Alpha, MIPS, SPARC, HP 3 The five basic components of a computer Datapath (Chs. 4-6) Control (Chs. 5-6) Memory (Ch. 7) Input and Output (Ch 8) In more detail, what will we cover in this course? Performance metrics, because these allow us to determine whether one machine is better than another, or whether a design modification is worthwhile. Instruction set design principles and examples (MIPS RISC mainly) MIPS assembly language and mapping ‘C’ onto MIPS ALU implementation concepts (integer and floating point) Block-diagram implementation of a basic MIPS processor Pipelining the implementation for better performance: Approaching one instruction per clock Superscalar implementation for better performance Issuing MORE than one instruction per clock! Memory hierarchy, including cache and virtual memory Multiprocessor systems (time permitting) Thus, we will look mainly at the hardware aspects of microprocessors, at a block diagram level. Why is this stuff important? You will learn how today’s computers work. Performance analysis will allow you to make educated choices in purchasing systems Software design is helped by (and sometimes requires) an understanding of the underlying hardware concepts 4 2. Apples and Oranges: The importance of PERFORMANCE Summary, performance is key. However, it is quite difficult to evaluate. We will present and critique various measures that are in use, both within and between architectures. Performance measures are key They influence numerous purchasing decisions every day! Can give a measure of usefulness to purchaserHow long will you have to wait for results? (finite element analysis) Within an architecture, can evaluate the effectiveness of design changes Textbooks would have you believe that people carefully compare the price/performance of different systems. Unlikely…. Most users know what architecture they want (wintel, mac) Many then go for the cheapest systems they can buy, leading to problems with reliability, compatibility, bad monitor, inadequate cache memory, software incompatibilities, etc. Some go to a reputable dealer and pay more to get some assurances Very few actually shop around much! Why? ….. Performance is difficult to characterize between different architectures Every user takes advantage of different capabilities (e.g. integer vs. floating point) Instructions different Superscalarity further muddies picture (how often are 2/4 instructions actually issued?) 5 Consider a first-order model of a processor: CONTROL DATAPATH I/O MEMORY SYSTEM CLOCK (Clock in cycles per second = Hertz, or cycle interval time=1/Hz in sec’s) Key Features: Control determines path of data through system Instructions decoded by control CLOCK sets rate at which instructions are: 1. Fetched 2. Decoded 3. Executed How about clock rate as a performance measure? One instruction could take several clock cycles Different machines have different instructions- one machine may require half as many instructions to execute a program than another DRAM reads may take many clocks, so cache memory performance is very important Newer CPUs can issue multiple instructions each clock cycle! WITHIN an architecture, OK to use clock rate as performance measure: Performance(80MHz)/Performance(60Mhz) = 80 / 60 = 1.21 So we say that the 80 Mhz system is 21% (or 1.21 times) faster than the 60 MHz system. We can’t say that one system is XXX% slower than another. 6 Another measure is CPI: average clocks per instruction. Can be determined by simulation or timing program execution OK within family and using same program: CPI = ExecutionTimeOfProgram / NumberInstructionsInProgram Perf(A)/Perf(B)=ExecutionTime(B)/ExecutionTime(A) = CPI(B)/CPI(A) Which program should one execute? Somewhat poor comparison between families, although most instruction sets are becoming more similar (RISC) How about MIPS: millions of instructions per second? Combines clock rate and CPI: PEAK MIPS is the fastest rate obtainable, it’s not measured MIPS Relative MIPS = ExecuteTimeref x MIPSref / ExecuteTimetest Works only for reference program and input So again, depends on program (or theory, for peak MIPS) You, as user, want the machine to respond as fast as possible, so let’s define performance in terms of execution time of some program on machine X: Performance(X) = 1/ExecutionTime(X) Then we can compare two different machines X and Y using that program: Performance(X)/Performance(Y)=ExecTime(Y)/ExecTime(X)=n So X is n times faster than Y (if Y is faster switch X and Y). Which program do we use? You can measure performance using the program (or algorithm type) which YOU will run, or use an industrystandard program MIX that exercises different machine functions (integer, float, memory access, I/O, etc). We’ll consider benchmarks shortly. How can we measure execution time? On UNIX machines, it’s possible to obtain an estimate using the time command: Time someCommand args Returns: 90.7u 12,9s 2:39 65% First val is sec’s executing user (application) code Second val is sec’s executing (operating) system code Third value is total elapsed time as seen by user Fourth value is % of elapsed time that program was executed 7 Let’s consider CPU execution time only (user + sys). Then for a program: CPUexecTime = CPUclockCycles x clockCycleTime CPUclockCycles = InstructsInProgram x AvgClockCyclesPerInstruct Can improve performance by decreasing clock cycle time (faster clock rate), or reducing the number of clock cycles required to execute program. The latter can be addressed by: More “powerful” instructions (maybe) – gives fewer instructions Fewer clocks per instruction – more efficient instruction execution More instructions running in parallel Faster memory access (many cycles “wasted” waiting for memory) NOTE: This is often ignored! Better compiler – issues fewer instructions More efficient O.S. – executes fewer instructions per system call ….. and so on! This gives a hint of where we are going in this course. We will study the techniques that are being used in today’s computers to improve performance. EXAMPLE: We are designing an architecture, and find that in order to reduce the average clocks per instruction from 3.5 to 2.7 (due to an improvement in the ALU), we have to decrease the clock speed by 25%. Is this worth it? CPUtime = InstructionCount x CPI x ClockCycleTime So, divide original by modified: CPUtime(o)/CPUtime(m) = 3.5 x 1.0 / (2.7 x 1.25) = 1.037 So the modified version is almost 4% faster than the original. BENCHMARK SUITES – a group of programs used to gauge performance To use a BENCHMARK suite of programs to more fairly allow comparison between different architectures, we need to consider: 1. Which programs should we use, and 2. How can we combine them into one performance measure? Unfortunately, neither of these questions is trivial! 8 Which programs should we use? Every user uses a different WORKLOAD (i.e. set of programs run) Different programs may have very different performance measures Want to exercise different capabilities of computer in one (or two) benchmark(s) BEST: you evaluate systems using the exact mix of programs you will use on your computer! How should they be combined? Arithmentic mean- average all execution times on a machine AM = 1/n TimeI Problem: longer-running programs wash out more fleeting prog’s. Geometric mean- independent of running time, each program has same weight GM = nth root of ( Execution Time Ratioi ) Where the Execution time ration is ExecTimetest / ExecTimeref The ExecTime ratio of the computer under test to a reference machine Neither is intrinsically better: depends what you want from your benchmark! Example: Test: Ref: App1 = 20 secs App1 = 100 secs Arithmetic mean: Test=130 sec App2 = 240 secs App2 = 120 secs ref=110sec -> ref machine is 18% faster Geometric mean: Test = sq. root(0.2 x 2.0)= 0.632 Ref = sq. root(5.0 x 0.5)= 1.58 -> test machine is 2.5 times faster!! BOTTOM LINE: Be careful how you slice your performance evaluation data. Amdahl’s Law Expecting the improvement of one aspect of a machine to increase performance by an amount proportional to the size of the improvement is wrong. For example: Suppose a program runs in 100 seconds on a machine, 9 with multiply instructions responsible for 80 seconds of this time. How much faster do I have to improve the speed of multiplication if I want my program to run 5 times faster? So the execution time is equal to the speed of the multiplications plus the other parts: Exec time after improvement = 80/n + (100 – 80 seconds) n is what we need to divide by to get our performance: Since we want the execution to be 5 times faster: 20 = 80/n +20 0 = 80/n which doesn’t work. You need to work on others sections as well to get a five-fold increase in performance. Example Problems: 1) (2.3) If the clock rates of machines M1 and M2 are 200 MHz and 300 MHz, find the clock cycles per instruction for program 1 given the following: Program Time on M1 Time on M2 1 10 sec 5 sec 2 3 sec 4 sec Program 1 Instructions on M1 200x106 Instructions on M2 160x106 Solution: CPI = Cycles per second / Instructions per second. So M1 = 200x106 / (200x106 / 10 sec) = 10 cycles per instruction M2 = 160x106 / (160x106 / 5 sec) = 9.4 cycles per instruction 2) (2.10) Consider 2 different implementations, M1 and M2 of the same instruction set. There are four classes of instructions (A, B, C and D) in the instruction set. M1 has a clock rate of 500MHz. The average number of cycles for each instruction class on M1 is: Class CPI for this class A 1 B 2 10 C D 3 4 M2 has a clock rate of 750 MHz. The average number of cycles for each instruction class on M2 is: Class CPI for this class A 2 B 2 C 4 D 4 Assume that peak performance is defined as the fastest rate that a machine can execute an instruction sequence chosen to maximize that rate. What are the peak performances of M1 and M2 expressed as instructions per second? Solution: For M1 the peak performance will be achieved with a sequence on instructions of class A, which have a CPI of 1. The peak performance is thus 500 MIPS. For M2 , a mixture of A and B instructions both of which have a CPI of 2, will achieve peak performance which is 375 MIPS. 3) (2.13) Consider 2 different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B and C) in the instruction set. M1 has a clock rate of 400 MHz and M2 has a clock rate of 200 MHz. The average number of cycles for each instruction set is given in the following: Class A B C CPI M1 4 6 8 CPI M2 2 4 3 C1 Usage 30% 50% 20% C2 Usage 30% 20% 50% 3rd party 50% 30% 20% The table also contains information of how the three different compilers use the instruction set. C1 is a compiler produced by the makers of M1, C2 is produced by the makers of M2 and the other is a third party compiler. Assume that each compiler uses the same number of instructions for a given program but that the instruction mix is as described in the table. 11 Using C1 on both M1 and M2 how much faster can the makers of M1 claim that M1 is over M2? Using C2 how much faster is M2 over M1? If you purchase M1 which compiler should you use? For M2? Solution: Using C1, the CPI on M1 = 4 * .3 + 6 * .5 + 8 * .2 = 5.8 M2 = 2 * .3 + 4 * .5 + 3* .2 = 3.2 So assuming that M1 is faster we have: (3.2/200E6) / (5.8/400E6) = 1.10 or M1 is 10% faster than M2 using C1 Using C2, the CPI on M1 = 4*.3 + 6*.2 + 8*.5 = 6.4 M2 = 2 * .3 + 4 * .2 + 3* .5 = 2.9 Assuming that M2 is faster we have (6.4/400E6) / (2.9/200E6) = 1.10 or now M2 is 10% faster than M1 using C2. For the 3rd party compiler: M1 = 4*.5 + 6*.3 + 8*.2 = 5.4 M2 = 2 * .5 + 4 * .3 + 3* .2 = 2.8 rd This tells us that the 3 party compiler is better for both machines since the CPI is lower. If we try M2 as the faster machine we get: (5.4/400E6) / (2.8/200E6) = .964 So M1 is the faster machine. How much faster? Reverse the equation: (2.8/200E6) / (5.4/400E6) = 1.037 or M1 is 3.7% faster than M2. 4) (2.15) Consider a program P, with the following mix of operations: floating-point multiply 10% floating-point add 15% floating-point divide 5% integer instructions 70% Machine MFP has floating point hardware and can implement the floating point operations directly. It requires the following number of clock cycles for each instruction class: floating-point multiply 6 floating-point add 4 floating-point divide 20 integer instructions 2 Machine MNFP has no floating point hardware and so must emulate the floating-point operations using integer instructions. The integer instructions 12 take 2 clock cycles. The number of integer instructions needed to implement each of the floating point operations is: floating-point multiply 30 floating-point add 20 floating-point divide 50 Both machines have a clock rate of 1000 MHz. Find the native MIPS rating for both machines. Solution: MIPS = Clock Rate / CPI x106 CPI for MFP = .1*6 + .15*4 + 0.05*20 + .7*2 = 3.6 CPI of MNFP = 2. MIPS for MFP = 1000/CPI = 278 MIPS for MNFP = 1000/CPI = 500 5) (2.16) If the machine in MFP in the last exercise needs 300 million instructions for a program, how many integer instructions does the machine MNFP require for the same program? Solution: This is really just a ratio type problem: INSTRUCTION Count on MFP in 106 floating-point multiply 30 floating-point add 45 floating-point divide 15 integer instructions 210 TOTALS 300 Count on MNFP in 106 900 900 750 210 2760 6) (2.26) The table below shows the number of floating point operations executed in two different programs and the runtime for those programs on three different machines: (times given in seconds) Program FP ops 1 10,000,000 2 100,000,000 Computer A 1 1000 Computer B 10 100 Computer C 20 20 Which machine is faster given total execution time? How much faster is it than the other two? 13 Solution: Total execution time for Computer A is 1001 seconds, Computer B is 110 seconds and C 40 seconds. So C is the fastest. It is 1001/40 = 25 times faster than A and 110/40 = 2.75 times faster than B. Homework problems: 1) Using the information from Example 2 above on M1 and M2, If the number of instructions executed in a certain program is divided equally among the classes of instructions, how much faster is M2 over M1? (Hint: find CPI of each machine first) 2) (2.18) You are the lead designer of a new processor. The processor design and compiler are complete, and now you must decide whether to produce the current design as it stands or spend additional time to improve it. Your options are: a) Leave the design as it stands. Call this base machine Mbase. It has a clock rate of 500 MHz and the following measurements have been made using a simulator: Instruction Class A B C D CPI 2 3 3 5 Frequency 40% 25% 25% 10% b) Optimize the hardware. The hardware team claims that it can improve the processor design to give it a clock rate of 600 MHz. Call this machine MOPT. The following measurements were made for MOPT: Instruction Class A B C D CPI 2 2 3 4 What is the CPI for each machine? 14 Frequency 40% 25% 25% 10% 3) (2.21) Using the material for Mbase and MOPT above, a compiler team proposes to improve the compiler for the machine to further enhance performance. Call this combination of the improved compiler and the base machine Mcomp. The instructions improvements from this enhanced compiler have been estimated as follows: Instruction Class % of instructions executed vs. base machine A 90% B 90% C 85% D 95% For example, if Mbase executed 500 Class A instructions, Mcomp would execute .9*500 = 450 instructions. What is the CPI for Mcomp? 4) (2.27) Given the material from Exercise 6 (problem 2.26) Which machine is fastest using Geometric mean? 5) (2.44) You are going to enhance a machine, and there are two possible improvements: either make multiply instructions run four times faster than before, or make memory access instructions run two times faster than before. You repeatedly run a program that takes 100 seconds to execute. Of this time, 20% is used for multiplications, 50% for memory instructions and 30% for other tasks. What will the speedup be if you improve only multiplication? What will the speedup be if you improve only memory access? What will the speedup be if both improvements are made? 15