Computer Architecture • We will use a quantitative approach to analyze architectures and potential improvements and see how well they work (if at all) – We study RISC instruction sets to promote instructionlevel, block-level and thread-level parallelism – Pipelining, superscalar, branch speculation, vector processing, multi-core & parallel processing – Out of order completion architectures – Compiler optimizations – Improving cache performance (and virtual memory performance if time permits) – Early on, we concentrate on the MIPS 5-stage pipeline, later we will also look at other approaches including the Pentium processing Performance Measures • Many different values can be used – MIPS, MegaFLOPS – misleading values – Clock Speed – does not take into account parallelism/pipeline, stalls, cache, etc – Execution time – we use this to compare benchmarks but we have to make sure the benchmarks were run equally (loaded system, unloaded system, etc) – Throughput – number of programs per unit of time, possibly useful for servers – CPU time, user CPU time, system CPU time – CPU performance = 1 / execution time • What does it really mean for 1 computer to be faster than another? – If we use a benchmark suite and 1 computer consistently outperforms the other, this is useful, otherwise we have to take into account the types of programs where one computer was better than the other SPEC 2006 Benchmarks Design Concepts • Take advantage of parallelism – There are many opportunities for parallelism in code • Use multiple hardware components (ALU functional units, register ports, caches/memory modules, disk drive access, etc) • Distribute instructions to hardware components in an overlapped (pipelined) or distributed (parallel) fashion • Use hardware and software approaches • Principle of locality of reference – Design memory systems to support this aspect of program and data access • Focus on the common case – Amdahl’s Law (next slide) demonstrates that minor improvements to the common case is usually more useful than large improvements to rare cases – Find ways to enhance the hardware for common cases over rare cases Amdahl’s Law • When comparing two systems, we view the speedup as – CPU time of old system / CPU time of new system • E.g., old system takes 10.5, new system takes 5.25, speedup = 10.5 / 5.25 = 2 • Speedup of one enhancement – 1 / (1 – F + F / S) • F = fraction of the time the enhancement can be used • S = the speedup of the enhancement itself (that is, how much faster the computer runs when the enhancement is in use) – Example: an integer processor performs FP operations in software routines, a benchmark consists of 14% FP operations, a co-processor could perform all FP operations 4 times faster, if we add the co-processor, our speedup is • 1 / (1 - .14 + .14 / 4) = 1.12, or a 12% speedup • Why does Amdahl’s Law promote the “common case”? – Since we have a reciprocal, the smaller the value, the greater the speedup – The denominator subtracts F from 1 and adds F / S, so –F will have a larger impact than F / S Examples • Web server enhancements: – Faster CPU (10 times faster on computation, 30% CPU operations, 70% I/O) • speedup = 1 / (1 - .3 + .3 / 10) = 1.37 (37% speedup) – More hard disk space that improves I/O performance by 2.5 • speedup = 1 / (1 - .7 + .7 / 2.5) = 1.72 (72% speedup) – Select the common case • A benchmark has 20% FP square root operations, 50% total FP operations, 50% other – Add an FP sqrt unit with a speedup of 10 • speedup = 1 / (1 - .2 + .2 / 10) = 1.22 – Add a new FP ALU with a speedup of 1.6 for all FP ops • speedup = 1 / (1 - .5 + .5 / 1.6) = 1.23 – Again, the common case is the better way to go (slightly) Another Example • Architects have suggested a new feature that can be used 20% of the time and offers a speedup of 3 • One architect though feels that he can provide a better enhancement that will offer a 7 time speedup for that particular feature • What percentage of the time would the second feature have to be used to match the first enhancement? – Speedup from feature 1 = 1 / (1 - .2 + .2 / 3) = 1.154 – For speedup from feature 2 = 1.154, we need to solve for x where 1.154 = 1 / (1 – x + x / 7) • Algebra gives us the following – 1 – x + x / 7 = 1 / 1.154 = .867 – 1 - .867 = x – x / 7 .133 = (7x – x) / 7 = 6x / 7 – 7 * .133 / 6 = x = .156, so the second feature would have to be used 15.6% of the time to offer the same speedup CPU Performance Formulae • Another way to compute speedup is to compute the CPU’s performance before and after some enhancement(s) • We will use the following formulae – CPU time = CPU clock cycles * clock cycle time • CPU clock cycles = number of elapsed clock cycles • CPU clock cycles = instruction count (IC) * clock cycles per instruction (CPI) = IC * CPI – not all instructions will have the same CPI, so we might have to compute this as (S CPIi * ICi) for all classes of instructions i – For instance, we might have loads, stores, branches, ALU (integer) operations, FP operations with CPIs of 5, 4, 3, 2 and 10 respectively • Clock cycle time = 1 / clock rate (we will abbreviate clock cycle time as CCT going forward) – Given two enhancements, compute their CPU exeuction time, speedup of machine 2 over machine 1= • CPU time machine1 / CPU time machine2 Example • Consider that we can either enhance the FP sqrt unit or enhance all FP units – IC breakdown: 25% FP operations, 2% of which are FP square root operations, 75% all other instructions – CPI: 4.0 for FP operations (on average across all FP operations), 20 for FP sqrt, 1.33 for all other instructions • CPI original machine = 25% * 4.0 + 75% * 1.33 = 2.00 – If we enhance all FP units, the overall CPI for FP operations becomes 2.5, if we enhance just the FP sqrt, it reduces to 2.0 • Compute the CPU time of each (note that IC and clock rate (CCT) remain the same) – CPI all FP = 75% * 1.33 + 25% * 2.5 = 1.625 – Speedup enhancing all FP = (IC * 2.00 * CCT) / (IC * 1.625 * CCT) = 1.23 – CPI FP sqrt = CPI original – 2% * (20 – 2) = 1.64 – Speedup enhancing FP sqrt = (IC * 2.00 * CCT) / (IC * 1.64 * CCT) = 1.22 – Enhancing all FP is better by 1.64 / 1.625 = 1.01, or about 1% Another Example • Our current machine has a load-store architecture and we want to know whether we should introduce a register-memory mode for ALU operations – Assume a benchmark of 21% loads, 12% stores, 43% ALU operations and 24% branches – CPI is 2 for all instructions except ALU which is 1 • The new mode will lengthen the ALU CPI of 2, and it also, as a side effect, lengthens Branch CPI to 3 – The IC will be reduced because we need fewer loads, let’s assume this new mode will be used in 25% of all ALU operations • Use the CPU execution time formula to determine the speedup of the new addressing mode Solution • The number of ALU operations that will use this new mode is 25%, or 43% * 25% = 11% – This means that we will have 11% fewer instructions so ICnew = 89% * ICold – Those dropped instructions will all be loads, so we will have a different breakdown of instruction mix • • • • – – – – Loads = (21% - 11%) / 89% = 11% Stores = 12% / 89% = 13% ALU = 43% / 89% = 48% Branches = 24% / 89% = 27% CPIold = 43% * 1 + 57% * 2 = 1.57 CPInew = (48% + 11% + 13%) * 2 + 27% * 3 = 1.89 CPU execution time old = IC * 1.57 * CCT CPU execution time new = .89 * IC * 1.89 * CCT • Speedup = 1.57 / (.89 * 1.89) = .933, which is actually a slowdown! – We would not want to use this enhancement Which Formula? • In a previous example, we solved the problem of FP sqrt or all FP units which we had solved earlier (slide 6) • Which approach should we use? – Depends on what information we are given, notice in using Amdahl’s law, we know the fraction of time an enhancement can be used and how much speedup that enhancement gives us – We can compute the same thing in the CPU time formula • Let’s try another example to prove it – Benchmark consists of 35% loads, 15% stores, 40% ALU and 10% branch operations with a CPI breakdown of 5 (loads/stores), 4 (ALU branches) – Enhancement: since we have separate INT and FP registers and this benchmark does not use the FP registers, can we use a compiler to move values from INT to FP registers and back rather than using the slower loads & stores? Yes. How much speedup will this give us? – Assume the compiler can reduce the loads/stores by 20% because of this enhancement Solution • CPI goes down, IC remains the same, CPU clock time is unchanged – Solution using CPU Time formula • CPIold = 50% * 5 + 50% * 4 = 4.5 – 20% of the loads/stores now become register moves, so our new breakdown of instructions is 40% load/store and 60% ALU • CPInew = 40% * 5 + 60% * 4 = 4.4 • Speedup = (4.5 * IC * CPU clock time) / (4.4 * IC * CPU clock time) = 4.5 / 4.4 = 1.023 or 2.3% speedup – Amdahl’s Law • Speedup of enhancement is 5/4 (5 cycles down to 4) = 1.25 • Fraction the enhancement can be used – the enhancement is used in 20% of the loads/stores which were 50% of the total instruction mix and these instructions took up 5 cycles of time each, so it is used .20 * .50 * 5 / 4.5 (the original CPI) = .111 • Speedup = 1 / (1 - .111 + .111 / 1.25) = 1.023 Instruction Set Principles • We studied instruction set design issues in 362 – Here, we develop a RISC instruction set to be used throughout the course, called MIPS • We want a fixed-length instruction format and a load-store instruction set both of which will help support a pipeline • What other issues should we consider? – Number of operands (2-operand or 3-operand)? – Number of registers and what type (should we differentiate between data and address registers?) • Memory issues – What addressing modes should we allow? – How many bits should we allow for address displacements, for immediate data? Comparisons Addressing Modes • Data will either be – Constants (immediate data) – Stored in registers – Stored in memory • For data stored in memory, there are numerous ways to specify the address – Direct, indirect (pointers), register indirect (pointers in registers), base displacement (sum of displacement and value in register) indexed (sum of values in two registers), etc – see the next slide – Complex modes can impact CPI because of the time it takes to obtain or compute the address • Design issues – How many bits should be allowed for an immediate datum or a displacement? Analysis of SPEC benchmarks indicate no more than 15 bits are needed for a displacement (displacements are < 32K) and 8 bits for most immediate data – Which modes? Again, an analysis of SPEC benchmarks indicate that immediate and displacement modes are most common (see figure A.7) Branch Issues • Branches typically use PC-relative branching – The branch target location is computed as PC PC + offset rather than PC offset, this keeps the offset to fewer bits in the instruction • also, with PC + offset, we do not need to know absolute memory locations at compile time allowing code to be moved in memory during execution • Branches break down into – Conditional branches (test a condition first) – Unconditional branches – Procedure calls and returns (require storing a return address, probably parameter passing as well) • register windows are used in some high performance computers for parameter passing (this is explained in the out-of-class notes) • Conditional branches make up 75-82% of all branches – Distance (offsets) for most branches can be limited to about 8 bits (255 locations) – see figure A.15 on page A-18 Continued • For procedure calls/returns, how is the state saved/restored – Through memory or register windows • What is the form of condition for conditional branches? – Complex conditions can be time consuming – Using condition codes is problematic with pipelines – A simple zero test (value == 0 or value != 0) is the simplest and fastest approach but requires additional instructions • e.g., to compare x == y + 1, do x – y + 1 first, then compare the result to 0 • When is the comparison performed? – With the branch instruction or prior to the branch? Types of Instructions • Arithmetic (integer) and logic operations • FP operations (+, -, *, /) and conversion between int and FP – We separate FP and integer operations for several reasons • they have different CPIs • we will use different register files • we will use different execution units in the pipeline • Data transfer (loads, stores) • Control (conditional, unconditional branches, procedure calls, returns, traps) • I/O • Strings (move, compare, search) • OS operations (OS system calls, virtual memory, other) • Graphics (pixel operations, compression/decompression, others) – In this course we will only concentrate on the first 4 classes although we will briefly consider vector operations as well, which are often used to support graphics – See figure A.13 on page A-16 for a breakdown of the SPECInt92 benchmark programs as executed on the Intel 80x86 architecture Embedded Application Instruction Sets • With RISC, instruction sets were being restricted – – – – Fewer instructions in the instruction set Fixed length instructions Fewer and simpler addressing modes Load-store instruction sets • With the popularity of embedded applications due to handheld devices, new restrictions are being introduced – 16-bit and 32-bit instruction sizes to accommodate narrower buses • Requires smaller memory addresses, smaller immediate data, fewer registers • This also improves cache performance because we can fit more in the caches • An alternative is to use compression on instructions, compress an instruction, fetch it, uncompress it in the CPU – IBM follows this Compiler Optimizations • In order to support the increasingly complex hardware, we need compiler support in the form of machine code optimizations, here are some examples: – High-level optimizations on source code • example: procedure in-lining, loop transformation – Local optimizations on single-lines of code • example: change the order of references in a block or expression – Global optimizations extend local across branches • example: loop unrolling – Register allocation to optimize the storage of variables in registers and minimize memory fetches – Machine-dependent optimizations • take advantage of the specific architecture – see Figure A.19, page A-25 and A.20, page A-28 • Two examples Continued – Sub-expression elimination – assume that a particular expression is used in several expressions, the value can be computed one time and stored in a register, later uses can reference the register and not have to re-compute the same expression – Graph coloring – an algorithm used to determine the best (or a good) allocation of local variables to registers • this is an NP complete problem, so compilers use a graph coloring approximation or heuristic algorithm instead • There is a problem with compiler optimizations: phaseordering – Since compiler optimizations are made in a particular order, one optimization might impact and wipe out the gain by an earlier optimization • consider for performing register allocation is performed near the end of optimization but sub-expression elimination, performed earlier in the process, needs some registers, so the earlier optimization relies on having access to registers which may be re-allocated later! Introduction to MIPS • Developed in 1985, since then, there have been many versions, here, we examine a subset called MIPS64 • RISC architecture designed for pipeline efficiency – optimizing compiler essential to improve efficiency • General-purpose register set and load-store architecture – 32 64-bit general purpose (integer) registers • labeled R0, …, R31 • R0 is always 0 • 8-, 16-, 32-bit values are sign extended to become 64 bit values – 32 64-bit floating point registers • labeled F0, …, F31 where only half the register is used for floats • No explicit character or string types – characters treated as ints, ala C – strings as arrays of ints • Arrays are available, using base-displacement addressing Continued • Two addressing modes used: displacement, immediate – direct addressing can be accomplished by using R0 as the displacement register – register indirect can be accomplished by using a base of 0 – displacements of 12-16 bits and immediate data of 8-16 bits – memory is byte addressable and 64-bit addresses are used • Approximately 100 operations – op code requires 7 bits, however we will reduce this to 6 bits by using one op code for all integer ALU operations – Fixed length 32-bit instructions – 3 instruction formats used (shown on the next slide) • I-type for immediate data, used for loads, stores, conditional branches and ALU operations that have an immediate datum as an operand • R-type for register type, used for all other ALU operations and FP operations • J-type for jump type, used for jump, jump and link (procedure call), trap, return – Immediate data and displacements are limited to 16 bits except for Jump instructions in which case displacements are limited to 26 bits Continued 3 operand instructions are available as long as all operands are in registers (R-type) or 2 registers and immediate datum (I-type) immediate datum (which is also used for displacement offsets) is limited to 16 bits (2’s complement) but extended to 32 bits funct is the specific type of ALU or FP function MIPS Instructions • See Figure A.26, page A-40 for full list, here we look at the instructions we will be using • Loads/Stores – LD, SD – load/store double word (we could also use LW, SW for word sized data movements) • LD R2, 204(R3) – load item from M[204 + R3] in R2 – L.S, L.D, S.S, S.D – load and store single and double precision floats (S = single, D = double) • L.S F3, 0(R5) – note the use of integer register for the base • ALU operations (integer only) – DADD/DSUB, DADDI/DSUBI, DMUL, DDIV – add/subtract/multiply/divide with 3 registers (or 2 registers and an immediate datum for add/subtract) • DADD R1, R2, R3 • DADDI R1, R2, #1 – also similar operations for AND, OR, shift, rotate – SLT, SLTI – set less than – used for comparison • SLT R1, R2, R3 – if R2 < R3, set R1 to 1, else R1 = 0 • Branch operations Continued – BEQZ, BNEZ – branch if register tested is 0/not zero • BEQZ R1, foo – PC = PC + foo if R1 = 0 – J – unconditional jump to given location • J foo – sets PC = PC + foo – JR – unconditional jump where offset is in given register • JR R3 – sets PC = PC + R3 • Floating Point operations – ADD.D, ADD.S, SUB.D, SUB.S, MUL.D, MUL.S, DIV.D, DIV.S • floating point operations, 3 FP registers • ADD.S F1, F2, F3 – C.__.D, C.__.S – FP comparisons, __ is LT, GT, LE, GE, EQ, NE – CVT.__.__ - converts from one type to another using two registers • CVT.D.L F2, R4 – convert double in F2 to long in R4 MIPS 5-Stage Architecture See section C.3 pages C-31-C-34 • IF: – PC sent to instruction cache – PC incremented by 4, stored in NPC temporarily IF & ID Stages • a MUX in the MEM stage determines if the PC should get the value in NPC or the value computed in EX – Instruction stored in IR • ID: – Bits 6..10 denote one source register (I-type and R-type instructions) – Bits 11..15 denote one source register (R-type) – Bits 16..32 store immediate datum or displacement, sign extended to 32 bits • NPC, A, B and IMM are temporary registers used in later stages • This stage EX Stage – computes ALU operations • using register A & B or A& IMM, result from ALU placed in ALU output register and passed on to next stage – computes effective addresses for loads and stores • A + IMM, stored in ALU output and passed onto next stage – computes branch target locations and performs the zero test to determine if a branch is taken or not • A zero tested • ALU adds PC + IMM, value sent to ALU output and passed to next stage • MEM: MEM and WB Stages – If load, ALU output stores address, sent to data cache, resulting datum stored in LMD – If store, ALU output stores address and B register stores datum, both are sent to data cache – If branch, based on condition, MUX either selects NPC or branch target location (as computed in the ALU EX) to send back to PC – If ALU, forward result from ALU output directly to LMD • WB: – If a datum in LMD (load or ALU), store in the register file Comments on the MIPS Architecture • The simplified nature of MIPS means that many tasks will require more than a single assembly/machine operation to complete – in CISC instruction sets, some operations can be done in 1 instruction, such as indirect addressing and compare-andbranch operations – registers must be pre-loaded with the data before performing an ALU operation – two or more instructions to perform scaled or indexed modes • The CPI of MIPS operations is less than those in other instruction sets making up for this – all operations have a CPI of 4 except Loads and ALU operations which have a CPI of 5 (because they must write their results to registers in the WB stage) • The static size of all MIPS operations makes it easier to deal with pre-fetching and pipelining Continued • The architecture requires the following hardware elements to implement: – the ALU should have all integer operations (arithmetic, logic) • we address floating point operations later in the semester – an additional adder is needed for the IF stage (PC increment) – several temporary registers are needed • IR, A, B, Imm, NPC, ALUOutput, LMD – multiplexors to select the following • what to do after a condition is evaluated • whether a computed value is to be used later in temporary registers A or B • whether to use a register value or the immediate datum • multiplexors in the ALU to select the output based on the specific ALU operation (not shown in the figure) • multiplexors in the register file to select which register to send on to the A or B temporary registers, and a demultiplexor to pass along the LMD value into one of the registers (not shown in the figure) MIPS Code Example • Write a set of code to compute the average of the elements in an int array, assuming the array starts at memory location 50000 and that the number of elements in the array is stored at location 10000 • Store the resulting float value at location 10004 DADDI DADDI LW CVT.W.S Loop: BEQZ LW DADD DADDI DSUBI J Out: CVT.W.S DIV.S S.S R1, R0, #50000 R2, R0, #0 R3, 10000(R0) F2, R3 R3, Out R4, 0(R1) R2, R2, R4 R1, R1, #4 R3, R3, #1 Loop F1, R2 F3, F1, F2 F3, 10004(R0) // R1 is our array index // R2 is our sum // R3 is our loop counter // copy number of elements into F2 as a float // If R3 = = 0, then exit loop // R4 is the next array element // convert sum to floating point // store resulting average Another Example • Write a set of code that will find the largest and smallest items in an array, the array’s starting location is stored in R5 and the array contains 500 elements, store the min in R1, the max in R2 Loop: SetMin: SetMax: Out: LW LW DADDI BEQZ DADDI LW DSUBI SLT BNEZ SLT BNEZ J DADDI J DADDI J … R1, 0(R5) R2, 0(R5) R3, R0, #500 R3, Out R5, R5, #4 R4, 0(R5) R3, R3, #1 R6, R4, R1 R6, SetMin R6, R2, R4 R6, SetMax Loop R1, R4, #0 Loop R2, R4, #0 Loop // R1 is min // R2 is max // R3 is our loop counter // reset array pointer to next element // load next array element // if R4 < R1, set R6 (new min to take care of) // if R2 < R4, set R6 (new max to take care of)