LOGO P r i n c e s s S u m a y a U n i v e r s i t y f o r Computer Architecture Dr. Esam Al_Qaralleh Te c h n o l o g y Review computer arctecture ~ PSUT 2 The Von Neumann Machine, 1945 The Von Neumann model consists of five major components: input unit output unit ALU memory unit control unit. Sequential Execution computer arctecture ~ PSUT 3 Von Neumann Model A refinement of the Von Neumann model, the system bus model has a CPU (ALU and control), memory, and an input/output unit. Communication among components is handled by a shared pathway called the system bus, which is made up of the data bus, the address bus, and the control bus. There is also a power bus, and some architectures may also have a separate I/O bus. computer arctecture ~ PSUT 4 Performance Both Hardware and Software affect performance: Algorithm determines number of sourcelevel statements Language/Compiler/Architecture determine machine instructions Processor/Memory determine how fast instructions are executed computer arctecture ~ PSUT 5 Computer Architecture Instruction Set Architecture - ISA refers to the actual programmer-visible machine interface such as instruction set, registers, memory organization and exception handling. Two main approaches: RISC and CISC architectures. computer arctecture ~ PSUT 6 Applications Change over Time Data-sets & memory requirements larger Cache & memory architecture become more critical Standalone networked IO integration & system software become more critical Single task multiple tasks Parallel architectures become critical • Limited IO requirements rich IO requirements 60s: tapes & punch cards 70s: character oriented displays 80s: video displays, audio, hard disks 90s: 3D graphics; networking, high-quality audio 00s: real-time video, immersion, … computer arctecture ~ PSUT 7 Application Properties to Exploit in Computer Design Locality in memory/IO references Programs work on subset of instructions/data at any point in time Both spatial and temporal locality Parallelism Data-level (DLP): same operation on every element of a data sequence Instruction-level (ILP): independent instructions within sequential program Thread-level (TLP): parallel tasks within one program Multi-programming: independent programs Pipelining Predictability Control-flow direction, memory references, data values computer arctecture ~ PSUT 8 Levels of Machines There are a number of levels in a computer, from the user level down to the transistor level. computer arctecture ~ PSUT 9 How Do the Pieces Fit Together? Application Operating System Compiler Memory system Firmware Instr. Set Proc. Instruction Set Architecture I/O system Datapath & Control Digital Design Circuit Design computer arctecture ~ PSUT 10 Instruction Set Architecture (ISA) Complex Instruction Set (CISC) Single instructions for complex tasks (string search, block move, FFT, etc.) Usually have variable length instructions Registers have specialized functions Reduced Instruction Set (RISC) Instructions for simple operations only Usually fixed length instructions Large orthogonal register sets computer arctecture ~ PSUT 11 RISC Architecture RISC designers focused on two critical performance techniques in computer design: the exploitation of instruction-level parallelism, first through pipelining and later through multiple instruction issue, the use of cache, first in simple forms and later using sophisticated organizations and optimizations. computer arctecture ~ PSUT 12 RISC ISA Characteristics All operations on data apply to data in registers and typically change the entire register; The only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register, respectively; A small number of memory addressing modes; The instruction formats are few in number with all instructions typically being one size; Large number of registers; These simple properties lead to dramatic simplifications in the implementation of advanced pipelining techniques, which is why RISC architecture instruction sets were designed this way. computer arctecture ~ PSUT 13 Performance & cost computer arctecture ~ PSUT 14 Computer Designers and Chip Costs The computer designer affects die size, and hence cost, both by what functions are included on or excluded from the die and by the number of I/O pins computer arctecture ~ PSUT 15 LOGO Measuring and Reporting Performance performance Time to do the task (Execution Time) – execution time, response time, latency Tasks per day, hour, week, sec, ns. .. (Performance) – performance, throughput, bandwidth Response time– the time between the start and the completion of a task Thus, to maximize performance, need to minimize execution time 1 perform ancex execution_ tim ex If X is n times faster than Y, then performancex execution_ timey N performancey execution_ timex Throughput – the total amount of work done in a given time Important to data center managers Decreasing response time almost always improves computer arctecture ~ PSUT throughput 17 Calculating CPU Performance Want to distinguish elapsed time and the time spent on our task CPU execution time (CPU time) – time the CPU spends working on a task Does not include time waiting for I/O or running other programs CPU _ Time CPU _ clock _ cycles _ for _ a _ program * Clock _ cycle _ time OR CPU _ clock _ cycles _ for _ a _ program CPU _ Time Clock _ rate Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program computer arctecture ~ PSUT 18 Calculating CPU Performance (Cont.) We tend to count instructions executed = IC Note looking at the object code is just a start What we care about is the dynamic count - e.g. don’t forget loops, recursion, branches, etc. CPI (Clock Per Instruction) is a figure of merit CPU _ clock _ cycles _ for _ a _ program CPI IC IC * CPI CPU _ Time IC * CPI * Clock _ cycle _ time Clock _ rate computer arctecture ~ PSUT 19 Calculating CPU Performance (Cont.) 3 Focus Factors -- Cycle Time, CPI, IC Sadly - they are interdependent and making one better often makes another worse (but small or predictable impacts) • Cycle time depends on HW technology and organization • CPI depends on organization (pipeline, caching...) and ISA • IC depends on ISA and compiler technology Often CPI’s are easier to deal with on a per instruction basis n CPU _ clock _ cycles CPI i * ICi n i 1 CPI i * ICi n ICi Overall _ CPI CPI i * Instructio n _ count i 1 Instructio n _ count i 1 computer arctecture ~ PSUT 20 Example of Computing CPU time If a computer has a clock rate of 50 MHz, how long does it take to execute a program with 1,000 instructions, if the CPI for the program is 3.5? Using the equation CPU time = Instruction count x CPI / clock rate gives CPU time = 1000 x 3.5 / (50 x 106) If a computer’s clock rate increases from 200 MHz to 250 MHz and the other factors remain the same, how many times faster will the computer be? CPU time old clock rate new 250 MHz ------------------- = ---------------------- = ---------------- = 1.25 CPU time new clock rate old 200 MHZ Evaluating ISAs Design-time metrics: Can it be implemented, in how long, at what cost? Can it be programmed? Ease of compilation? Static Metrics: How many bytes does the program occupy in memory? Dynamic Metrics: How many instructions are executed? How many bytes does the processor fetch to execute the program? CPI How many clocks are required per instruction? Best Metric: Time to execute the program! Inst. Count depends on the instructions set, the processor organization, and compilation computer techniques. arctecture ~ PSUT Cycle Time 24 LOGO Quantitative Principles of Computer Design Amdahl’s Law Defines speedup gained from a particular feature Depends on 2 factors Fraction of original computation time that can take advantage of the enhancement - e.g. the commonality of the feature Level of improvement gained by the feature Amdahl’s law Quantification of the diminishing return principle computer arctecture ~ PSUT 26 Amdahl's Law (Cont.) Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected computer arctecture ~ PSUT 27 Simple Example Important Application: Amdahl’s Law says FPSQRT 20% FP instructions account for 50% nothing about cost Other 30% Designers say same cost to speedup: FPSQRT by 40x FP by 2x Other by 8x Which one should you invest? Straightforward plug in the numbers & compare BUT what’s your guess?? computer arctecture ~ PSUT 28 And the Winner Is…? computer arctecture ~ PSUT 29 Example of Amdahl’s Law Floating point instructions are improved to run twice as fast, but only 10% of the time was spent on these instructions originally. How much faster is the new machine? 1 ExTimeold Speedup = = (1 - Fractionenhanced) + Fractionenhanced ExTimenew Speedupenhanced 1 Speedup = = 1.053 (1 - 0.1) + 0.1/2 ° The new machine is 1.053 times as fast, or 5.3% faster. ° How much faster would the new machine be if floating point instructions become 100 times faster? 1 Speedup = = 1.109 (1 - 0.1) + 0.1/100 Estimating Performance Improvements Assume a processor currently requires 10 seconds to execute a program and processor performance improves by 50 percent per year. By what factor does processor performance improve in 5 years? (1 + 0.5)^5 = 7.59 How long will it take a processor to execute the program after 5 years? ExTimenew = 10/7.59 = 1.32 seconds Performance Example Computers M1 and M2 are two implementations of the same instruction set. M1 has a clock rate of 50 MHz and M2 has a clock rate of 75 MHz. M1 has a CPI of 2.8 and M2 has a CPI of 3.2 for a given program. How many times faster is M2 than M1 for this program? ExTimeM1 = ExTimeM2 ICM1 x CPIM1 / Clock RateM1 = ICM2 x CPIM2 / Clock RateM2 2.8/50 3.2/75 What would the clock rate of M1 have to be for them to have the same execution time? = 1.31 Simple Example Suppose we have made the following measurements: Frequency of FP operations (other than FPSQR) =25% Average CPI of FP operations=4.0 Average CPI of other instructions=1.33 Frequency of FPSQR=2% CPI of FPSQR=20 Two design alternatives Reduce the CPI of FPSQR to 2 Reduce the average CPI of all FP operations to 2 computer arctecture ~ PSUT 33 And The Winner is… n ICi CPI original CPI i * (4 * 25%) (1.33 * 75%) 2.0 Instructio n _ count i 1 CPI with _ new _ FPSQR CPI original 2% * (CPI oldFPSQR CPI ofnewFPSQRonly) 2.0 2% * (20 2) 1.64 CPI newFP (75% * 1.33) (25% * 2.0) 1.5 computer arctecture ~ PSUT 34 Instruction Set Architecture (ISA) computer arctecture ~ PSUT 35 Outline Introduction Classifying instruction set architectures Instruction set measurements Memory addressing Addressing modes for signal processing Type and size of operands Operations in the instruction set Operations for media and signal processing Instructions for control flow Encoding an instruction set MIPS architecture computer arctecture ~ PSUT 36 LOGO Instruction Set Principles and Examples Basic Issues in Instruction Set Design What operations and How many Load/store/Increment/branch are sufficient to do any computation, but not useful (programs too long!!). How (many) operands are specified? Most operations are dyadic (e.g., AB+C); Some are monadic (e.g., A B). How to encode them into instruction format? Instructions should be multiples of Bytes. Typical Instruction Set 32-bit word Basic operand addresses are 32-bit long. Basic operands (like integer) are 32-bit long. In general, Instruction could refer 3 operands (AB+C). Challenge: Encode operations in a small number of bits. computer arctecture ~ PSUT 38 Brief Introduction to ISA Instruction Set Architecture: a set of instructions Each instruction is directly executed by the CPU’s hardware How is it represented? By a binary format since the hardware understands only bits 6 opcode 5 rs 5 16 rt Immediate Options - fixed or variable length formats Fixed - each instruction encoded in same size field (typically 1 word) Variable – half-word, whole-word, multiple word instructions are possible computer arctecture ~ PSUT 39 What Must be Specified? Instruction Format (encoding) How is it decoded? Location of operands and result Where other than memory? How many explicit operands? How are memory operands located? Data type and Size Operations What are supported? computer arctecture ~ PSUT 40 LOGO Classifying Instruction Set Architecture Instruction Set Design CPU _ Time IC * CPI * Cycle _ time The instruction set influences everything computer arctecture ~ PSUT 42 Instruction Characteristics Usually a simple operation Which operation is identified by the op-code field But operations require operands - 0, 1, or 2 To identify where they are, they must be addressed • Address is to some piece of storage • Typical storage possibilities are main memory, registers, or a stack 2 options explicit or implicit addressing Implicit - the op-code implies the address of the operands • ADD on a stack machine - pops the top 2 elements of the stack, then pushes the result • HP calculators work this way Explicit - the address is specified in some field of the instruction • Note the potential for 3 addresses - 2 operands + the destination computer arctecture ~ PSUT 43 Operand Locations for Four ISA Classes computer arctecture ~ PSUT 44 C=A+B Stack Register (registermemory) Push A Push B Add • Pop the top-2 values of the stack (A, B) and push the result value into the stack Pop C Accumulator (AC) Load A Add B • Add AC (A) with B and store the result into AC Store C Load R1, A Add R3, R1, B Store R3, C Register (load-store) Load R1, A Load R2, B Add R3, R1, R2 Store R3, C computer arctecture ~ PSUT 45 Modern Choice – Load-store Register (GPR) Architecture Reasons for choosing GPR (general-purpose registers) architecture Registers (stacks and accumulators…) are faster than memory Registers are easier and more effective for a compiler to use • (A+B) – (C*D) – (E*F) – May be evaluated in any order (for pipelining concerns or …) » But on a stack machine must left to right Registers can be used to hold variables • Reduce memory traffic • Speed up programs • Improve code density (fewer bits are used to name a register) Compiler writers prefer that all registers be equivalent and unreserved The number of GPR: at least 16 computer arctecture ~ PSUT 46 LOGO Memory Addressing Memory Addressing Basics All architectures must address memory What is accessed - byte, word, multiple words? Today’s machine are byte addressable Main memory is organized in 32 - 64 byte lines Big-Endian or Little-Endian addressing Hence there is a natural alignment problem Size s bytes at byte address A is aligned if A mod s = 0 Misaligned access takes multiple aligned memory references Memory addressing mode influences instruction counts (IC) and clock cycles per instruction (CPI) computer arctecture ~ PSUT 48 Big-Endian and Little-Endian Assignments Big-Endian: lower byte addresses are used for the most significant bytes of the word Little-Endian: opposite ordering. lower byte addresses are used for the less significant bytes of the word Word address Byte address Byte address 0 0 1 2 3 0 3 2 1 0 4 4 5 6 7 4 7 6 5 4 • • • k 2 -4 k 2 -4 k 2 -3 • • • k 2- 2 k 2 - 1 (a) Big-endian assignment k 2 - 4 k 2- 1 k 2 - 2 k 2 -3 k 2 -4 (b) Little-endian assignment computer arctecture ~ PSUT Byte and word addressing. 49 Addressing Modes Immediate Add R4, #3 Regs[R4] Regs[R4]+3 Register Add R4, R3 Regs[R4] Regs[R4]+Regs[R3] R3 Operand:3 Register Indirect Add R4, (R1) Regs[R4] Regs[R4]+Mem[Regs[R1]] R1 Operand Registers Operand Registers Memory computer arctecture ~ PSUT 50 Addressing Modes(Cont.) Direct Memory Indirect Add R4, (1001) Add R4, @(R3) Regs[R4] Regs[R4]+Mem[1001] Regs[R4] Regs[R4]+Mem[Mem[Regs[R3]]] R3 1001 Operand Operand Memory Registers computer arctecture ~ PSUT Memory 51 Addressing Modes(Cont.) Displacement Add R4, 100(R1) Regs[R4] Regs[R4]+Mem[100+R1] R1 100 Scaled Add R1, 100(R2) [R3] Regs[R1] Regs[R1]+Mem[100+ Regs[R2]+Regs[R3]*d] R3 R2 100 Operand Operand *d Registers Memory Registers computer arctecture ~ PSUT Memory 52 Typical Address Modes (I) computer arctecture ~ PSUT 53 Typical Address Modes (II) computer arctecture ~ PSUT 54 Operand Type & Size Typical types: assume word= 32 bits Character - byte - ASCII or EBCDIC (IBM) - 4 per word Short integer - 2- bytes, 2’s complement Integer - one word - 2’s complement Float - one word - usually IEEE 754 these days Double precision float - 2 words - IEEE 754 BCD or packed decimal - 4- bit values packed 8 per word computer arctecture ~ PSUT 55 LOGO ALU Operations What Operations are Needed Arithmetic + Logical Integer arithmetic: ADD, SUB, MULT, DIV, SHIFT Logical operation: AND, OR, XOR, NOT Data Transfer - copy, load, store Control - branch, jump, call, return, trap System - OS and memory management We’ll ignore these for now - but remember they are needed Floating Point Same as arithmetic but usually take bigger operands Decimal String - move, compare, search Graphics – pixel and vertex, compression/decompression operations computer arctecture ~ PSUT 57 Top 10 Instructions for 80x86 load: 22% conditional branch: 20% compare: 16% store: 12% add: 8% and: 6% sub: 5% move register-register: 4% call: 1% return: 1% The most widely executed instructions are the simple operations of an instruction set The top-10 instructions for 80x86 account for 96% of instructions executed Make them fast, as they are the common case computer arctecture ~ PSUT 58 Control Instructions are a Big Deal Jumps - unconditional transfer Conditional Branches How is condition code set? – by flag or part of the instruction How is target specified? How far away is it? Calls How is target specified? How far away is it? Where is return address kept? How are the arguments passed? Callee vs. Caller save! Returns Where is the return address? How far away is it? How are the results passed? computer arctecture ~ PSUT 59 Branch Address Specification Known at compile time for unconditional and conditional branches - hence specified in the instruction As a register containing the target address As a PC-relative offset Consider word length addresses, registers, and instructions Full address desired? Then pick the register option. • BUT - setup and effective address will take longer. If you can deal with smaller offset then PC relative works • PC relative is also position independent - so simple linker duty computer arctecture ~ PSUT 60 Returns and Indirect Jumps Branch target is not known at compile time Need a way to specify the target dynamically Use a register Permit any addressing mode Regs[R4] Regs[R4] + Mem[Regs[R1]] Also useful for case or switch Dynamically shared libraries High-order functions or function pointers computer arctecture ~ PSUT 61 LOGO Encoding an Instruction Set Encoding the ISA Encode instructions into a binary representation for execution by CPU Can pick anything but: Affects the size of code - so it should be tight Affects the CPU design - in particular the instruction decode So it may have a big influence on the CPI or cycle-time Must balance several competing forces Desire for lots of addressing modes and registers Desire to make average program size compact Desire to have instructions encoded into lengths that will be easy to handle in a pipelined implementation (multiple of bytes) computer arctecture ~ PSUT 63 3 Popular Encoding Choices Variable (compact code but difficult to encode) Primary opcode is fixed in size, but opcode modifiers may exist Opcode specifies number of arguments - each used as address fields Best when there are many addressing modes and operations Use as few bits as possible, but individual instructions can vary widely in length e. g. VAX - integer ADD versions vary between 3 and 19 bytes Fixed (easy to encode, but lengthy code) Every instruction looks the same - some field may be interpreted differently Combine the operation and the addressing mode into the opcode e. g. all modern RISC machines Hybrid Set of fixed formats e. g. IBM 360 and Intel 80x86 Trade-off between size of program VS. ease of decoding computer arctecture ~ PSUT 64 3 Popular Encoding Choices (Cont.) computer arctecture ~ PSUT 65 An Example of Variable Encoding -- VAX addl3 r1, 737(r2), (r3): 32-bit integer add instruction with 3 operands need 6 bytes to represent it Opcode for addl3: 1 byte A VAX address specifier is 1 byte (4-bits: addressing mode, 4-bits: register) • r1: 1 byte (register addressing mode + r1) • 737(r2) – 1 byte for address specifier (displacement addressing + r2) – 2 bytes for displacement 737 • (r3): 1 byte for address specifier (register indirect + r3) Length of VAX instructions: 1—53 bytes computer arctecture ~ PSUT 66 Short Summary – Encoding the Instruction Set Choice between variable and fixed instruction encoding Code size than performance variable encoding Performance than code size fixed encoding computer arctecture ~ PSUT 67 LOGO Role of Compilers Critical goals in ISA from the compiler viewpoint What features will lead to high-quality code What makes it easy to write efficient compilers for an architecture computer arctecture ~ PSUT 69 Compiler and ISA ISA decisions are no more for programming AL easily Due to HLL, ISA is a compiler target today Performance of a computer will be significantly affected by compiler Understanding compiler technology today is critical to designing and efficiently implementing an instruction set Architecture choice affects the code quality and the complexity of building a compiler for it computer arctecture ~ PSUT 70 Optimization Observations Hard to reduce branches Biggest reduction is often memory references Some ALU operation reduction happens but it is usually a few % Implication: Branch, Call, and Return become a larger relative % of the instruction mix Control instructions among the hardest to speed up computer arctecture ~ PSUT 71 LOGO The MIPS Architecture MIPS Instruction Format Encode addressing mode into the opcode All instructions are 32 bits with 6-bit primary opcode computer arctecture ~ PSUT 73 MIPS Instruction Format (Cont.) I-Type Instruction 6 5 opcode rs 5 rt 16 Immediate Loads and Stores LW R1, 30(R2), S.S F0, 40(R4) ALU ops on immediates DADDIU R1, R2, #3 rt <-- rs op immediate Conditional branches BEQZ R3, offset rs is the register checked rt unused immediate specifies the offset Jump registers ,jump and link register JR R3 rs is target register rt and immediate are unused but = 011 computer arctecture ~ PSUT 74 MIPS Instruction Format (Cont.) 6 opcode R-Type Instruction 5 5 5 rs rt rd 5 shamt 6 func Register-register ALU operations: rdrs funct rt DADDU R1, R2, R3 Function encodes the data path operations: Add, Sub... read/write special registers Moves J-Type Instruction: Jump, Jump and Link, Trap and return from exception 6 26 opcode Offset added to PC computer arctecture ~ PSUT 75 Data path computer arctecture ~ PSUT 76 The processor : Data Path and Control Data PC Address Instructions Register # Register Bank Register # Instruction Memory A L U Address Data Memory Register # Data Two types of functional units: elements that operate on data values (combinational) elements that contain state (state elements) computer arctecture ~ PSUT 77 Single Cycle Implementation State element 1 Combinational Logic State element 2 Clock Cycle Typical execution: read contents of some state elements, send values through some combinational logic write results to one or more state elements Using a clock signal for synchronization Edge triggered methodology computer arctecture ~ PSUT 78 A portion of the datapath used for fetching instructions computer arctecture ~ PSUT 79 The datapath for R-type instructions computer arctecture ~ PSUT 80 The datapath for load and store insructions computer arctecture ~ PSUT 81 The datapath for branch instructions computer arctecture ~ PSUT 82 Complete Data Path computer arctecture ~ PSUT 83 Control Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: lW $1, 100($2) Value of control signals is dependent upon: what instruction is being executed which step is being performed computer arctecture ~ PSUT 84 Data Path with Control computer arctecture ~ PSUT 85 Single Cycle Implementation Calculate cycle time assuming negligible delays except: memory (2ns), ALU and adders (2ns), register file access (1ns) computer arctecture ~ PSUT 86 pipelining computer arctecture ~ PSUT 87 Basic DataPath •What do we need to add to actually split the datapath into stages? computer arctecture ~ PSUT 88 Pipeline DataPath computer arctecture ~ PSUT 89 The Five Stages of the Load Instruction Cycle 1 Cycle 2 Load Ifetch Reg/Dec Cycle 3 Cycle 4 Cycle 5 Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory computer arctecture Wr: Write the data back~ PSUT to the register file 90 Pipelined Execution Time IFetch Dcd Exec IFetch Dcd Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem WB Exec Mem IFetch Dcd IFetch Dcd IFetch Dcd Program Flow IFetch Dcd On a processor multiple instructions are in various stages at the same time. Assume each instruction takes five cycles computer arctecture ~ PSUT WB 91 Single Cycle, Multiple Cycle, vs. Pipeline computer arctecture ~ PSUT 92 Graphically Representing Pipelines Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Are two instructions trying to use the same resource computer arctecture ~ PSUT 93 at the same time? Why Pipeline? Because the resources are there! computer arctecture ~ PSUT 94 Why Pipeline? Suppose 100 instructions are executed The single cycle machine has a cycle time of 45 ns The multicycle and pipeline machines have cycle times of 10 ns The multicycle machine has a CPI of 4.6 Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns Ideal pipelined vs. single cycle speedup 4500 ns / 1040 ns = 4.33 What has not yet been considered? computer arctecture ~ PSUT 95 Compare Performance Compare: Single-cycle, multicycle and pipelined control using SPECint2000 Single-cycle: memory access = 200ps, ALU = 100ps, register file read and write = 50ps 200+50+100+200+50=600ps Multicycle: 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALU CPI = 4.12, The clock cycle = 200ps (longest functional unit) Pipelined 1 clock cycle when there is no load-use dependence 2 when there is, average 1.5 per load Stores and ALU take 1 clock cycle Branches - 1 when predicted correctly and 2 when not, average 1.25 Jump – 2 1.5x25%+1x10%+1x52%+1.25x11%+2x2% = 1.17 Average instruction time: single-cycle = 600ps, multicycle = 4.12x200=824, pipelined 1.17x200 = 234ps Memory access 200ps is the bottleneck. How to improve? computer arctecture ~ PSUT 96 Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time • E.g., two instructions try to read the same memory at the same time data hazards: attempt to use item before it is ready • instruction depends on result of prior instruction still in the pipeline add r1, r2, r3 sub r4, r2, r1 control hazards: attempt to make a decision before condition is evaulated • branch instructions beq r1, loop add r1, r2, r3 Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards computer arctecture ~ PSUT 97 Single Memory is a Structural Hazard Time (clock cycles) Instr 4 Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Instr 3 Reg ALU Instr 2 Mem Mem ALU Instr 1 Reg ALU O r d e r Load Mem ALU I n s t r. Mem Reg Detection is easy in this case! (right half highlight means read, left half write)98 computer arctecture ~ PSUT Structural Hazards limit performance Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI = 1.3 otherwise resource is more than 100% utilized Solution 1: Use separate instruction and data memories Solution 2: Allow memory to read and write more than one word per cycle computer arctecture ~ PSUT 99 Solution 3: Stall Control Hazard Solutions Stall: wait until decision is clear Its possible to move up decision to 2nd stage by adding hardware to check registers as being read Beq Load Reg Mem Mem Reg Reg Mem Reg Mem Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg Impact: 2 clock cycles per branch instruction => slow computer arctecture ~ PSUT 100 Control Hazard Solutions Predict: guess one direction then back up if wrong Beq Load Reg Mem Mem Reg Reg Mem Reg Mem Reg ALU Add Mem ALU O r d e r Predict not takenTime (clock cycles) ALU I n s t r. Mem Reg Impact: 1 clock cycle per branch instruction if right, 2 if wrong (right 50% of time) More dynamic scheme: history of 1 branch ( computer arctecture ~ PSUT 90%) 101 Control Hazard Solutions Redefine branch behavior (takes place after next instruction) “delayed branch” Misc Load Mem Mem Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Beq Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) Launch more instructions per clock cycle=>less computer arctecture ~ PSUT useful 102 Data Hazard computer arctecture ~ PSUT 103 Data Hazard---Forwarding Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding computer arctecture ~ PSUT 104 Can’t always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register. Thus, we need a hazardcomputer detection unit ~toPSUT “stall” the load instruction 105 arctecture Stalling We can stall the pipeline by keeping an instruction in the same stage computer arctecture ~ PSUT 106 Memory Hierarchy Design computer arctecture ~ PSUT 107 LOGO 5.1 Introduction Memory Hierarchy Design Motivated by the principle of locality - A 90/10 type of rule Take advantage of 2 forms of locality • Spatial - nearby references are likely • Temporal - same reference is likely soon Also motivated by cost/performance structures Smaller hardware is faster: SRAM, DRAM, Disk, Tape Access vs. bandwidth variations Fast memory is more expensive Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level computer arctecture ~ PSUT 109 Memory relevance in Computer Design ? A computer’s performance is given by the number of instructions executed per time unit The time for executing an instruction depends on: The ALU speed (I.e. the data-path cycle duration) The time it takes for each instruction to load/store its operands/result from/into the memory (in brief, the time to access memory) The processing speed (CPU speed) grows faster than the memory speed. As a result the CPU speed cannot be fully exploited. This speed gap leads to an Unbalanced System ! computer arctecture ~ PSUT 110 Levels in A Typical Memory Hierarchy computer arctecture ~ PSUT 111 Unit of Transfer / Addressable Unit Unit of Transfer: Number of bits read from, or written into memory at a time Internal : usually governed by data bus width External : usually a block of words e.g 512 or more. Addressable unit: smallest location which can be uniquely addressed Internal : word or byte External : device dependent e.g. a disk “cluster” computer arctecture ~ PSUT 112 Access Method Sequential Data is stored in records, access is in linear sequence (tape) Direct Data blocks have a unique and direct access, data within block is sequential (disk) Random Data has unique and direct access (ram) Associative Data retrieved based on (partial) match rather than address (cache) computer arctecture ~ PSUT 113 LOGO 5.2 Review of the ABCs of Caches 36 Basic Terms on Caches Cache Full associative Write allocate Virtual memory dirty bit unified cache memory stall cycles block offset misses per instruction directed mapped write back block valid bit data cache locality block address hit time address trace write through cache miss set instruction cache page fault random placement average memory access time miss rate index field cache hit n-way set associative no-write allocate page least-recently used write buffer miss penalty tag field write stall computer arctecture ~ PSUT 115 Cache The first level of the memory hierarchy encountered once the address leaves the CPU Persistent mismatch between CPU and main-memory speeds Exploit the principle of locality by providing a small, fast memory between CPU and main memory -- the cache memory Cache is now applied whenever buffering is employed to reuse commonly occurring terms (ex. file caches) Caching – copying information into faster storage system Main memory can be viewed as a cache for secondary storage computer arctecture ~ PSUT 116 General Hierarchy Concepts At each level - block concept is present (block is the caching unit) Block size may vary depending on level • Amortize longer access by bringing in larger chunk • Works if locality principle is true Hit - access where block is present - hit rate is the probability Miss - access where block is absent (in lower levels) - miss rate Mirroring and consistency Data residing in higher level is subset of data in lower level Changes at higher level must be reflected down - sometime • Policy of sometime is the consistency mechanism Addressing Whatever the organization you have to know how to get at it! Address checking and protection computer arctecture ~ PSUT 117 Physical Address Structure Key is that you want different block sizes at different levels computer arctecture ~ PSUT 118 Latency and Bandwidth The time required for the cache miss depends on both latency and bandwidth of the memory (or lower level) Latency determines the time to retrieve the first word of the block Bandwidth determines the time to retrieve the rest of this block A cache miss is handled by hardware and causes processors following in-order execution to pause or stall until the data are available computer arctecture ~ PSUT 119 Predicting Memory Access Times On a hit: simple access time to the cache On a miss: access time + miss penalty Miss penalty = access time of lower + block transfer time Block transfer time depends on • Block size - bigger blocks mean longer transfers • Bandwidth between the two levels of memory – Bandwidth usually dominated by the slower memory and the bus protocol Performance Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate * Miss-Penalty Memory-stall-cycles = IC * Memory-reference-perinstruction * Miss-Rate * Miss-Penalty computer arctecture ~ PSUT 120 Four Standard Questions Block Placement Where can a block be placed in the upper level? Block Identification How is a block found if it is in the upper level? Block Replacement Which block should be replaced on a miss? Write Strategy What happens on a write? Answer the four questions for the first level of the memory hierarchy computer arctecture ~ PSUT 121 Block Placement Options Direct Mapped (Block address) MOD (# of cache blocks) Fully Associative Can be placed anywhere Set Associative Set is a group of n blocks -- each block is called a way Block first mapped into a set (Block address) MOD (# of cache sets) Placed anywhere in the set Most caches are direct mapped, 2- or 4-way set associative computer arctecture ~ PSUT 122 Block Placement Options (Cont.) computer arctecture ~ PSUT 123 Block Identification Each cache block carries tags Address Tags: which block am I? Many memory blocks may map to the same cache block Physical address now: address tag## set index## block offset Note relationship of block size, cache size, and tag size The smaller the set tag the cheaper it is to find Status Tags: what state is the block in? valid, dirty, etc. Physical address = r + m + n bits r (address tag) m (set index) n (block offset) 2m addressable sets in the cache 2n bytes per block computer arctecture ~ PSUT 124 Block Identification (Cont.) Physical address = r + m + n bits r (address tag) m 2m addressable sets in the cache n 2n bytes per block • Caches have an address tag on each block frame that gives the block address. • A valid bit to say whether or not this entry contains a valid address. • The block frame address can be divided into the tag field and the index field. computer arctecture ~ PSUT 125 Block Replacement Random: just pick one and chuck it Simple hash game played on target block frame address Some use truly random • But lack of reproducibility is a problem at debug time LRU - least recently used Need to keep time since each block was last accessed • Expensive if number of blocks is large due to global compare • Hence approximation is oftenOnly usedone = Use bitfor tagdirect-mapped and LFU choice FIFO placement computer arctecture ~ PSUT 126 Short Summaries from the Previous Figure More-way associative is better for small cache 2- or 4-way associative perform similar to 8-way associative for larger caches Larger cache size is better LRU is the best for small block sizes Random works fine for large caches FIFO outperforms random in smaller caches Little difference between LRU and random for larger caches computer arctecture ~ PSUT 127 Improving Cache Performance MIPS mix is 10% stores and 37% loads Writes are about 10%/(100%+10%+37%) = 7% of overall memory traffic, and 10%/(10%+37%)=21% of data cache traffic Make the common case fast Implies optimizing caches for reads Read optimizations Block can be read concurrent with tag comparison On a hit the read information is passed on On a miss the - nuke the block and start the miss access Write optimizations Can’t modify until after tag check - hence take longer computer arctecture ~ PSUT 128 Write Options Write through: write posted to cache line and through to next lower level Incurs write stall (use an intermediate write buffer to reduce the stall) Write back Only write to cache not to lower level Implies that cache and main memory are now inconsistent • Mark the line with a dirty bit • If this block is replaced and dirty then write it back Pro’s and Con’s both are useful Write through • No write on read miss, simpler to implement, no inconsistency with main memory Write back • Uses less main memory bandwidth, write times independent of main memory speeds • Multiple writes within a block require only one write to the main memory computer arctecture ~ PSUT 129 LOGO 5.3 Cache Performance Cache Performance computer arctecture ~ PSUT 131 Cache Performance Example Each instruction takes 2 clock cycle (ignore memory stalls) Cache miss penalty – 50 clock cycles Miss rate = 2% Average 1.33 memory reference per instructions • • • • Ideal – IC * 2 * cycle-time With cache – IC*(2+1.33*2%*50)*cycle-time = IC * 3.33 * cycle-time No cache – IC * (2+1.33*100%*50)*cycle-time The importance of cache for CPUs with lower CPI and higher clock rates is greater – Amdahl’s Law computer arctecture ~ PSUT 132 Average Memory Access Time VS CPU Time Compare two different cache organizations Miss rate – direct-mapped (1.4%), 2-way associative (1.0%) Clock-cycle-time – direct-mapped (2.0ns), 2-way associative (2.2ns) CPI with a perfect cache – 2.0, average memory reference per instruction – 1.3; miss-penalty – 70ns; hittime – 1 CC • Average Memory Access Time (Hit time + Miss_rate * Miss_penalty) • AMAT(Direct) = 1 * 2 + (1.4% * 70) = 2.98ns • AMAT(2-way) = 1 * 2.2 + (1.0% * 70) = 2.90ns • CPU Time • CPU(Direct) = IC * (2 * 2 + 1.3 * 1.4% * 70) = 5.27 * IC • CPU(2-way) = IC * (2 * 2.2 + 1.3 * 1.0% * 70) = 5.31 * IC Since CPU time is our bottom-line evaluation, and since direct mapped is simpler to build, the preferred cache is direct mapped in this example computer arctecture ~ PSUT 133 Unified and Split Cache Unified – 32KB cache, Split – 16KB IC and 16KB DC Hit time – 1 clock cycle, miss penalty – 100 clock cycles Load/Store hit takes 1 extra clock cycle for unified cache 36% load/store – reference to cache: 74% instruction, 26% data • Miss rate(16KB instruction) = 3.82/1000/1.0 = 0.004 Miss rate (16KB data) = 40.9/1000/0.36 = 0.114 • Miss rate for split cache – (74%*0.004) + (26%*0.114) = 0.0324 Miss rate for unified cache – 43.3/1000/(1+0.36) = 0.0318 • Average-memory-access-time = % inst * (hit-time + inst-miss-rate * miss-penalty) + % data * (hit-time + data-miss-rate * miss-penalty) • AMAT(Split) = 74% * (1 + 0.004 * 100) + 26% * (1 + 0.114 * 100) = 4.24 • AMAT(Unified) = 74% * (1 + 0.0318 * 100) + 26% * (1 + 1 + 0.0318* 100) = 4.44 computer arctecture ~ PSUT 134 Improving Cache Performance Average-memory-access-time = Hittime + Miss-rate * Miss-penalty Strategies for improving cache performance Reducing the miss penalty Reducing the miss rate Reducing the miss penalty or miss rate via parallelism Reducing the time to hit in the cache computer arctecture ~ PSUT 135 LOGO 5.4 Reducing Cache Miss Penalty Techniques for Reducing Miss Penalty Multilevel Caches (the most important) Critical Word First and Early Restart Giving Priority to Read Misses over Writes Merging Write Buffer Victim Caches computer arctecture ~ PSUT 137 Multi-Level Caches Probably the best miss-penalty reduction Performance measurement for 2-level caches AMAT = Hit-time-L1 + Miss-rate-L1* Misspenalty-L1 Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2 AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-timeL2 + Miss-rate-L2 * Miss-penalty-L2) computer arctecture ~ PSUT 138 Critical Word First and Early Restart Do not wait for full block to be loaded before restarting CPU Critical Word First – request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early restart -- as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Benefits of critical word first and early restart depend on Block size: generally useful only in large blocks Likelihood of another access to the portion of the block that has not yet been fetched • Spatial locality problem: tend to want next sequential word, so not clear if benefit block computer arctecture ~ PSUT 139 Victim Caches Remember what was just discarded in case it is need again Add small fully associative cache (called victim cache) between the cache and the refill path Contain only blocks discarded from a cache because of a miss Are checked on a miss to see if they have the desired data before going to the next lower-level of memory • If yes, swap the victim block and cache block Addressing both victim and regular cache at the same time • The penalty will not increase Jouppi (DEC SRC) shows miss reduction of 20 - 95% For a 4KB direct mapped cache with 1-5 victim blocks computer arctecture ~ PSUT 140 Victim Cache Organization computer arctecture ~ PSUT 141 LOGO 5.5 Reducing Miss Rate Classify Cache Misses - 3 C’s Compulsory independent of cache size First access to a block no choice but to load it Also called cold-start or first-reference misses Capacity decrease as cache size increases Cache cannot contain all the blocks needed during execution, then blocks being discarded will be later retrieved Conflict (Collision) decrease as associativity increases Side effect of set associative or direct mapping A block may be discarded and later retrieved if too many blocks map to the same cache block computer arctecture ~ PSUT 143 Techniques for Reducing Miss Rate Larger Block Size Larger Caches Higher Associativity Way Prediction Caches Compiler optimizations computer arctecture ~ PSUT 144 Larger Block Sizes Obvious advantages: reduce compulsory misses Reason is due to spatial locality Obvious disadvantage Higher miss penalty: larger block takes longer to move May increase conflict misses and capacity miss if cache is small Don’t let increase in miss penalty outweigh the decrease in miss rate computer arctecture ~ PSUT 145 Large Caches Help with both conflict and capacity misses May need longer hit time AND/OR higher HW cost Popular in off-chip caches computer arctecture ~ PSUT 146 Higher Associativity 8-way set associative is for practical purposes as effective in reducing misses as fully associative 2: 1 Rule of thumb 2 way set associative of size N/ 2 is about the same as a direct mapped cache of size N (held for cache size < 128 KB) Greater associativity comes at the cost of increased hit time Lengthen the clock cycle Hill [1988] suggested hit time for 2-way vs. 1-way: external cache +10%, internal + 2% computer arctecture ~ PSUT 147 Way Prediction Extra bits are kept in cache to predict the way, or block within the set of the next cache access Multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle A miss results in checking the other blocks for matches in subsequent clock cycles Alpha 21264 uses way prediction in its 2-way set-associative instruction cache. Simulation using SPEC95 suggested way prediction accuracy is in excess of 85% computer arctecture ~ PSUT 148 Compiler Optimization for Data Idea – improve the spatial and temporal locality of the data Lots of options Array merging – Allocate arrays so that paired operands show up in same cache block Loop interchange – Exchange inner and outer loop order to improve cache performance Loop fusion – For independent loops accessing the same data, fuse these loops into a single aggregate loop Blocking – Do as much as possible on a sub- block before moving on computer arctecture ~ PSUT 149 Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; val key /* After: 1 array of stuctures */ struct merge { int val; val key val key val key int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality computer arctecture ~ PSUT 150 LOGO 5.7 Reducing Hit Time Reducing Hit Time Hit time is critical because it affects the clock cycle time On many machines, cache access time limits the clock cycle rate A fast hit time is multiplied in importance beyond the average memory access time formula because it helps everything Average-Memory-Access-Time = HitAccess-Time + Miss-Rate * Miss-Penalty • Miss-penalty is clock-cycle dependent computer arctecture ~ PSUT 152 Techniques for Reducing Hit Time Small and Simple Caches Avoid Address Translation during Indexing of the Cache Pipelined Cache Access Trace Caches computer arctecture ~ PSUT 153 Cache Optimization Summary computer arctecture ~ PSUT 154 LOGO 5.9 Main Memory Main Memory -- 3 important issues Capacity Latency Access time: time between a read is requested and the word arrives Cycle time: min time between requests to memory (> access time) • Memory needs the address lines to be stable between accesses By addressing big chunks - like an entire cache block (amortize the latency) Critical to cache performance when the miss is to main Bandwidth -- # of bytes read or written per unit time Affects the time it takes to transfer the block computer arctecture ~ PSUT 156 3 Examples of Bus Width, Memory Width, and Memory Interleaving to Achieve Memory Bandwidth computer arctecture ~ PSUT 157 Wider Main Memory Doubling or quadrupling the width of the cache or memory will doubling or quadrupling the memory bandwidth Miss penalty is reduced correspondingly Cost and Drawback More cost on memory bus Multiplexer between the cache and the CPU may be on the critical path (CPU is still access the cache one word at a time) • Multiplexors can be put between L1 and L2 The design of error correction become more complicated • If only a portion of the block is updated, all other portions must be read for calculating the new error correction code Since main memory is traditionally expandable by the customer, the minimum increment is doubled or quadrupled computer arctecture ~ PSUT 158 LOGO 5.10 Virtual Memory Virtual Memory Virtual memory divides physical memory into blocks (called page or segment) and allocates them to different processes With virtual memory, the CPU produces virtual addresses that are translated by a combination of HW and SW to physical addresses, which accesses main memory. The process is called memory mapping or address translation Today, the two memory-hierarchy levels controlled by virtual memory are DRAMs and magnetic disks computer arctecture ~ PSUT 160 Example of Virtual to Physical Address Mapping Mapping by a page table computer arctecture ~ PSUT 161 Address Translation Hardware for Paging frame number frame offset f (l-n) d (n) computer arctecture ~ PSUT 162 Cache vs. VM Differences Replacement Cache miss handled by hardware Page fault usually handled by OS Addresses VM space is determined by the address size of the CPU Cache space is independent of the CPU address size Lower level memory For caches - the main memory is not shared by something else For VM - most of the disk contains the file system • File system addressed differently - usually in I/ O space • VM lower level is usually called SWAP space computer arctecture ~ PSUT 163 2 VM Styles - Paged or Segmented? Virtual systems can be categorized into two classes: pages (fixed-size blocks), and segments (variable-size blocks) Page Segment Words per address One Two (segment and offset) Programmer visible? Invisible to application programmer May be visible to application programmer Replacing a block Trivial (all blocks are the same size) Hard (must find contiguous, variable-size, unused portion of main memory) Memory use inefficiency Internal fragmentation (unused portion of page) External fragmentation (unused pieces of main memory) Efficient disk traffic Yes (adjust page size to balance access time and transfer time) Not always (small segments may transfer just a few bytes) computer arctecture ~ PSUT 164 Virtual Memory – The Same 4 Questions Block Placement Choice: lower miss rates and complex placement or vice versa • Miss penalty is huge, so choose low miss rate place anywhere • Similar to fully associative cache model Block Identification - both use additional data structure Fixed size pages - use a page table Variable sized segments - segment table frame number frame offset f (l-n) d (n) computer arctecture ~ PSUT 165 Virtual Memory – The Same 4 Questions (Cont.) Block Replacement -- LRU is the best However true LRU is a bit complex – so use approximation • Page table contains a use tag, and on access the use tag is set • OS checks them every so often - records what it sees in a data structure - then clears them all • On a miss the OS decides who has been used the least and replace that one Write Strategy -- always write back Due to the access time to the disk, write through is silly Use a dirty bit to only write back pages that have been modified computer arctecture ~ PSUT 166 Techniques for Fast Address Translation Page table is kept in main memory (kernel memory) Each process has a page table Every data/instruction access requires two memory accesses One for the page table and one for the data/instruction Can be solved by the use of a special fast-lookup hardware cache called associative registers or translation look-aside buffers (TLBs) If locality applies then cache the recent translation TLB = translation look-aside buffer TLB entry: virtual page no, physical page no, protection bit, use bit, dirty bit computer arctecture ~ PSUT 167 TLB = Translation Look-aside Buffer The TLB must be on chip; otherwise it is worthless Fully associative – parallel search Typical TLB’s Hit time - 1 cycle Miss penalty - 10 to 30 cycles Miss rate - .1% to 2% TLB size - 32 B to 8 KB computer arctecture ~ PSUT 168 Paging Hardware with TLB computer arctecture ~ PSUT 169 TLB of Alpha 21264 Address Space Number: process ID to prevent context switch A total of 128 TLB entries computer arctecture ~ PSUT 170 Page Size – An Architectural Choice Large pages are good: Reduces page table size Amortizes the long disk access If spatial locality is good then hit rate will improve Reduce the number of TLB miss Large pages are bad: More internal fragmentation • If everything is random each structure’s last page is only half full Process start computer up time takes arctecture ~ PSUTlonger 171