OMSE 510: Computing Foundations 3: Caches, Assembly, CPU Overview Chris Gilmore <grimjack@cs.pdx.edu> Portland State University/OMSE Today Caches DLX Assembly CPU Overview Computer System (Idealized) Disk CPU Memory Disk Controller System Bus The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Input Control Memory Datapath Output Next Topic: Simple caching techniques Many ways to improve cache performance Recap: Levels of the Memory Hierarchy Upper Level Processor faster Instr. Operands Cache Blocks Memory Pages Disk Files Tape Larger Lower Level Recap: exploit locality to achieve fast memory Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. By taking advantage of the principle of locality: Present the user with as much memory as is available in the cheapest technology. Provide access at the speed offered by the fastest technology. DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty To Processor Upper Level Memory Lower Level Memory Blk X From Processor Blk Y The Art of Memory System Design Workload or Benchmark programs Processor reference stream <op,addr>, <op,addr>,<op,addr>,<op,addr>, . . . op: i-fetch, read, write Memory $ MEM Optimize the memory system organization to minimize the average memory access time for typical workloads Example: Fully Associative Fully Associative Cache No Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache 31 4 Cache Tag (27 bits long) 0 Byte Select Ex: 0x01 Valid Bit Cache Data X Byte 31 X Byte 63 : : Cache Tag Byte 1 Byte 33 Byte 32 X X X : : Byte 0 : Example: 1 KB Direct Mapped Cache with 32 B Blocks For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Block address 31 9 Cache Tag Example: 0x50 4 0 Cache Index Byte Select Ex: 0x01 Ex: 0x00 Stored as part of the cache “state” Cache Tag Cache Data Byte 31 0x50 Byte 63 : : Valid Bit Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : : Byte 992 31 Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Valid Cache Tag : : Adr Tag Compare Cache Data Cache Index Cache Data Cache Block 0 Cache Block 0 : : Sel1 1 Mux 0 Sel0 OR Hit Cache Block Cache Tag Valid : : Compare Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Valid Cache Tag : : Adr Tag Compare Cache Data Cache Index Cache Data Cache Block 0 Cache Block 0 : : Sel1 1 Mux 0 Sel0 OR Hit Cache Block Cache Tag Valid : : Compare Block Size Tradeoff Larger block size take advantage of spatial locality BUT: Larger block size means larger miss penalty: Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up Too few cache blocks In general, Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Average Rate Access Miss Miss Penalty Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Block Size Block Size Time Increased Miss Penalty & Miss Rate Block Size A Summary on Sources of Cache Misses Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity Capacity: Cache cannot contain all blocks access by the program Solution: increase cache size Coherence (Invalidation): other process (e.g., I/O) updates memory Source of Cache Misses Quiz Assume constant cost. Direct Mapped N-way Set Associative Fully Associative Cache Size: Small, Medium, Big? Compulsory Miss: Conflict Miss Capacity Miss Coherence Miss Choices: Zero, Low, Medium, High, Same es of Cache Misses Answer Direct Mapped Cache Size Compulsory Miss Big Same N-way Set Associative Medium Same Fully Associative Small Same Conflict Miss High Medium Zero Capacity Miss Low Medium High Coherence Miss Same Same Same Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant. Recap: Four Questions for Caches and Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Fully associative: block 12 can go anywhere Block no. 01234567 Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. 01234567 Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no. Block-frame address Block no. 1111111111222222222233 01234567890123456789012345678901 01234567 Set Set Set Set 0 1 2 3 Q2: How is a block found if it is in the upper level? Block Address Tag Block offset Index Set Select Data Select Direct indexing (using index and block offset), tag compares, or combination Increasing associativity shrinks index, expands tag Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random FIFO LRU (Least Recently Used) LFU (Least Frequently Used) Associativity: 2-way 4-way Size LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 64 KB 1.9% 2.0% 1.5% 1.7% 256 KB 1.15% 1.17% 1.13% 1.13% 8-way LRU Random 4.4% 5.0% 1.4% 1.5% 1.12% 1.12% Q4: What happens on a write? Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes, coherency easier WB: no writes of repeated writes WT always combined with write buffers so they don’t wait for lower level memory Write Buffer for Write Through Processor Cache DRAM Write Buffer A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Write Buffer Saturation Processor Cache DRAM Write Buffer Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): Store buffer will overflow no matter how big you make it The CPU Cycle Time <= DRAM Write Cycle Time Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: (does this always work?) Processor Cache Write Buffer L2 Cache DRAM Write-miss Policy: Write Allocate versus Not Allocate Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? Yes: Write Allocate No: Write Not Allocate 31 9 Cache Tag Cache Tag Cache Index Byte Select Ex: 0x00 Ex: 0x00 0 Cache Data 0x50 Byte 31 Byte 63 : : Valid Bit Example: 0x00 4 Byte 1 Byte 0 0 Byte 33 Byte 32 1 2 3 : : Byte 1023 : : Byte 992 31 Impact on Cycle Time PC Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity I -Cache miss IR IRex A B invalid IRm Average Memory Access time = Hit Time + Miss Rate x Miss Penalty R D Cache IRwb Time = IC x CT x (ideal CPI + memory stalls) T Miss What happens on a Cache miss? For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID IF EX ID Mem stall stall stall … stall Mem Wr EX stall stall stall … stall stall Ex Wr Use Full/Empty bits in registers + MSHR queue MSHR = “Miss Status/Handler Registers” (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. Per cache-line: keep info about memory address. For each word: register (if any) that is waiting for result. Used to “merge” multiple requests to one memory line New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline. Attempt to use register before result returns causes instruction to block in decode stage. Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. Out-of-order pipelines already have this functionality built in… (load queues, etc). Improving Cache Performance: 3 general options Time = IC x CT x (ideal CPI + memory stalls) Average Memory Access time = Hit Time + (Miss Rate x Miss Penalty) = (Hit Rate x Hit Time) + (Miss Rate x Miss Time) 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. 3Cs Absolute Miss Rate (SPEC92) 0.14 1-way Conflict Miss Rate per Type 0.12 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02 64 32 16 8 Cache Size (KB) 128 Compulsory vanishingly small 4 2 1 0 Compulsory 2:1 Cache Rule miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 0.14 1-way Conflict 2-way 0.1 4-way 0.08 8-way 0.06 Capacity 0.04 0.02 Cache Size (KB) 128 64 32 16 8 4 2 0 1 Miss Rate per Type 0.12 Compulsory 3Cs Relative Miss Rate 100% Miss Rate per Type 1-way 80% Conflict 2-way 4-way 8-way 60% 40% Capacity 20% 128 64 Flaws: for fixed block size Good: insight => invention Cache Size (KB) 32 16 8 4 2 1 0% Compulsory 1. Reduce Misses via Larger Block Size 25% 1K 20% 15% 16K 10% 64K 5% 256K Block Size (bytes) 256 128 64 32 0% 16 Miss Rate 4K 2. Reduce Misses via Higher Associativity 2:1 Cache Rule: Miss Rate DM cache size N Miss Rate 2-way cache size N/2 Beware: Execution time is only final measure! Will Clock Cycle time increase? Hill [1988] suggested hit time for 2-way vs. 1-way external cache +10%, internal + 2% Example: Avg. Memory Access Time vs. Miss Rate Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Cache Size Associativity (KB) 1-way 2-way 4-way 8-way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 1.61 1.53 8 1.46 1.48 1.47 1.43 16 1.29 1.32 1.32 1.32 32 1.20 1.24 1.25 1.27 64 1.14 1.20 1.21 1.23 128 1.10 1.17 1.18 1.20 (Green means A.M.A.T. not improved by more associativity) (AMAT = Average Memory Access Time) 3. Reducing Misses via a “Victim Cache” How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS DATA Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy 4. Reducing Misses by Hardware Prefetching E.g., Instruction Prefetching Alpha 21064 fetches 2 blocks on a miss Extra block placed in “stream buffer” On miss check stream buffer Works with data blocks too: Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 streams got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Prefetching relies on having extra memory bandwidth that can be used without penalty 5. Reducing Misses by Software Prefetching Data Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 6. Reducing Misses by Compiler Optimizations McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows Improving Cache Performance (Continued) 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. 0. Reducing Penalty: Faster DRAM / Interface New DRAM Technologies RAMBUS - same initial latency, but much higher bandwidth Synchronous DRAM Better BUS interfaces CRAY Technique: only use SRAM 1. Reducing Penalty: Read Priority over Write on Miss Processor Cache DRAM Write Buffer A Write Buffer Allows reads to bypass writes Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation 1. Reducing Penalty: Read Priority over Write on Miss Write-Buffer Issues: Write through with write buffers offer RAW conflicts with main memory reads on cache misses If simply wait for write buffer to empty, might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read 2. Reduce Penalty: Early Restart and Critical Word First Don’t wait for full block to be loaded before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Generally useful only in large blocks, Spatial locality a problem; tend to want next sequential word, so not clear if benefit by early restart block 3. Reduce Penalty: Non-blocking Caches Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires F/E bits on registers or out-of-order execution requires multi-bank memories “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires multiple memory banks (otherwise cannot support) Pentium Pro allows 4 outstanding memory misses Value of HitHit Miss for SPEC Under i Under Misses 2 1.8 Avg. Mem. Access Time 1.6 1.4 0->1 1.2 1->2 1 2->64 0.8 Base 0.6 0.4 “Hit under n Misses” 0.2 Integer ora spice2g6 nasa7 alvinn hydro2d mdljdp2 wave5 su2cor doduc swm256 tomcatv fpppp ear mdljsp2 compress xlisp espresso eqntott 0 Floating Point FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss 4. Reduce Penalty: Second-Level Cache L2 Equations AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1 Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2 Proc L1 Cache L2 Cache AMAT = Hit TimeL1 + Miss RateL1 x (Hit TimeL2 + Miss RateL2 x Miss PenaltyL2) Definitions: Local miss rate— misses in this cache divided by the total number of memory accesses to this cache (Miss rateL2) Global miss rate—misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 x Miss RateL2) Global Miss Rate is what matters Reducing Misses: which apply to L2 Cache? Reducing Miss Rate 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 4. Reducing Misses by HW Prefetching Instr, Data 5. Reducing Misses by SW Prefetching Data 6. Reducing Capacity/Conf. Misses by Compiler Optimizations L2 cache block size & A.M.A.T. Relative CPU Time 2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 1.95 1.54 1.36 16 1.28 1.27 32 64 1.34 128 256 512 Block Size 32KB L1, 8 byte path to memory Improving Cache Performance (Continued) 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache: - Lower Associativity (+victim caching)? - 2nd-level cache - Careful Virtual Memory Design Summary #1/ 3: The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three (+1) Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Capacity Misses: increase cache size Coherence Misses: Caused by external processors or I/O devices Cache Design Space total size, block size, associativity replacement policy write-hit policy (write-through, write-back) write-miss policy Summary #2 / 3: The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size Bad Good Factor A Less Factor B More Summary #3 / 3: Cache Miss Optimization miss rate Lots of techniques people use to improve the miss rate of caches: Technique MR MP HT Larger Block Size + – Higher Associativity + – Victim Caches + Pseudo-Associative Caches + HW Prefetching of Instr/Data + Compiler Controlled Prefetching+ Compiler Reduce Misses + Complexity 0 1 2 2 2 3 0 Onto Assembler! What is assembly language? Machine-Specific Programming Language one-one correspondence between statements and native machine language matches machine instruction set and architecture What is an assembler? Systems Level Program Usually works in conjunction with the compiler translates assembly language source code to machine language object file - contains machine instructions, initial data, and information used when loading the program listing file - contains a record of the translation process, line numbers, addresses, generated code and data, and a symbol table Why learn assembly? Learn how a processor works Understand basic computer architecture Explore the internal representation of data and instructions Gain insight into hardware concepts Allows creation of small and efficient programs Allows programmers to bypass high-level language restrictions Might be necessary to accomplish certain operations Machine Representation A language of numbers, called the Processor’s Instruction Set The set of basic operations a processor can perform Each instruction is coded as a number Instructions may be one or more bytes Every number corresponds to an instruction Assembly vs Machine Machine Language Programming Writing a list of numbers representing the bytes of machine instructions to be executed and data constants to be used by the program Assembly Language Programming Using symbolic instructions to represent the raw data that will form the machine language program and initial data constants Assembly Mnemonics represent Machine Instructions Each mnemonic used represents a single machine instruction The assembler performs the translation Some mnemonics require operands Operands provide additional information register, constant, address, or variable Assembler Directives Instruction Set Architecture: a Critical Interface software instruction set hardware Portion of the machine that is visible to the programmer or the compiler writer. Good ISA Lasts through many implementations (portability, compatibility) Can be used for many different applications (generality) Provide convenient functionality to higher levels Permits an efficient implementation at lower levels Von Neumann Machines Von Neumann “invented” stored program computer in 1945 Instead of program code being hardwired, the program code (instructions) is placed in memory along with data Control ALU Program Data Basic ISA Classes Memory to Memory Machines Every instruction contains a full memory address for each operand Maybe the simplest ISA design However memory is slow Memory is big (lots of address bits) Memory-to-memory machine Assumptions Two operands per operation first operand is also the destination Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte) Example A = B+C (hypothetical code) mov A, B ; A <= B add A, C ; A <= B+C 5 bytes for instruction 4 bytes for fetch 1st and 2nd operands 4 bytes to store results add needs 17 bytes and mov needs 13 byts Total 30 bytes memory traffic Why CPU Storage? A small amount of storage in the CPU To reduce memory traffic by keeping repeatedly used operands in the CPU Avoid re-referencing memory Avoid having to specify full memory address of the operand This is a perfect example of “make the common case fast”. Simplest Case A machine with 1 cell of CPU storage: the accumulator Accumulator Machine Assumptions Two operands per operation 1st operand in the accumulator 2nd operand in the memory accumulator is also the destination (except for store) Memory address = 16 bits (2 bytes) Operand size = 32 bits (4 bytes) Instruction code = 8 bits (1 byte) Example A = B+C (hypothetical code) Load B ; acc <= B Add C ; acc <= B+C Store A ; A <= acc 3 bytes for instruction 4 bytes to load or store the second operand 7 bytes per instruction 21 bytes total memory traffic Stack Machines Instruction sets are based on a stack model of execution. Aimed for compact instruction encoding Most instructions manipulate top few data items (mostly top 2) of a pushdown stack. Top few items of the stack are kept in the CPU Ideal for evaluating expressions (stack holds intermediate results) Were thought to be a good match for high level languages Awkward Become very slow if stack grows beyond CPU local storage No simple way to get data from “middle of stack” Stack Machines Binary arithmetic and logic operations Operands: top 2 items on stack Operands are removed from stack Result is placed on top of stack Unary arithmetic and logic operations Operand: top item on the stack Operand is replaced by result of operation Data move operations Push: place memory data on top of stack Pop: move top of stack to memory General Purpose Register Machines With stack machines, only the top two elements of the stack are directly available to instructions. In general purpose register machines, the CPU storage is organized as a set of registers which are equally available to the instructions Frequently used operands are placed in registers (under program control) Reduces instruction size Reduces memory traffic General Purpose Registers Dominate 1975-present all machines use general purpose registers Advantages of registers registers are faster than memory registers are easier for a compiler to use e.g., (A*B) – (C*D) – (E*F) can do multiplies in any order registers can hold variables memory traffic is reduced, so program is sped up (since registers are faster than memory) code density improves (since register named with fewer bits than memory location) Classifying General Purpose Register Machines General purpose register machines are subclassified based on whether or not memory operands can be used by typical ALU instructions Register-memory machines: machines where some ALU instructions can specify at least one memory operand and one register operand Load-store machines: the only instructions that can access memory are the “load” and the “store” instructions Comparing number of instructions Code sequence for A = B+C for five classes of instruction sets: Memory to Memory mov A B add A C Accumulator load B add C store A Stack push B push C add pop A Register (Register-memory) load R1 B add R1 C store A R1 Register (Load-store) Load R1 B Load R2 C Add R1 R1 R2 Store A R1 DLX/MIPS is one of these Instruction Set Definition Objects = architecture entities = machine state Registers General purpose Special purpose (e.g. program counter, condition code, stack pointer) Memory locations Linear address space: 0, 1, 2, …,2^s -1 Operations = instruction types Data operation Arithmetic Logical Data transfer Move (from register to register) Load (from memory location to register) Store (from register to memory location) Instruction sequencing Branch (conditional) Jump (unconditional) Topic: DLX Instructional Architecture Much nicer and easier to understand than x86 (barf) The Plan: Teach DLX, then move to x86/y86 DLX: RISC ISA, very similar to MIPS Great links to learn more DLX: http://www.softpanorama.org/Hardware/ar chitecture.shtml#DLX DLX Architecture Based on observations about instruction set architecture Emphasizes: Simple load-store instruction set Design for pipeline efficiency Design for compiler target DLX registers 32 32-bit GPRS named R0, R1, ..., R31 32 32-bit FPRs named F0, F2, ..., F30 Accessed independently for 32-bit data Accessed in pairs for 64-bit (double-precision) data Register R0 is hard-wired to zero Other status registers, e.g., floating-point status register Byte addressable in big-endian with 32-bit address Arithmetic instructions operands must be registers MIPS: Software conventions for Registers 0 zero constant 0 16 s0 callee saves 1 at . . . (callee must save) 2 v0 expression evaluation & 23 s7 3 v1 function results 24 t8 4 a0 arguments 25 t9 5 a1 26 k0 reserved for OS kernel 6 a2 27 k1 7 a3 28 gp Pointer to global area 8 t0 ... 15 t7 reserved for assembler temporary (cont’d) temporary: caller saves 29 sp Stack pointer (callee can clobber) 30 fp frame pointer 31 ra Return Address (HW) Addressing Modes This table shows the most common modes. Addressing Mode Example Instruction Meaning When Used Register Add R4, R3 R[R4] <- R[R4] + R[R3] When a value is in a register. Immediate Add R4, #3 R[R4] <- R[R4] + 3 For constants. Displacement Add R4, 100(R1) R[R4] <- R[R4] + M[100+R[R1] ] Accessing local variables. Register Deferred Add R4, (R1) R[R4] <- R[R4] + M[R[R1] ] Using a pointer or a computed address. Absolute Add R4, (1001) R[R4] <- R[R4] + M[1001] Used for static data. Memory Organization Viewed as a large, single-dimension array, with an address. A memory address is an index into the array "Byte addressing" means that the index points to a byte of memory. 8 bits of data 0 1 2 3 4 5 6 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data 8 bits of data Memory Addressing Bytes are nice, but most data items use larger "words" For DLX, a word is 32 bits or 4 bytes. 2 questions for design of ISA: Since one could read a 32-bit word as four loads of bytes from sequential byte addresses or as one load word from a single byte address, How do byte addresses map to word addresses? Can a word be placed on any byte boundary? Addressing Objects: Endianess and Alignment Big Endian: address of most significant byte = word address (xx00 = Big End of word) IBM 360/370, Motorola 68k, MIPS, Sparc, HP PA Little Endian: address of least significant byte = word address (xx00 = Little End of word) Intel 80x86, DEC Vax, DEC Alpha (Windows NT) little endian byte 0 3 2 1 0 lsb msb 0 0 big endian byte 0 1 2 3 Aligned Not Alignment: require that objects fall on address that is multiple of their size. Aligned 1 2 3 Assembly Language vs. Machine Language Assembly provides convenient symbolic representation much easier than writing down numbers e.g., destination first Machine language is the underlying reality e.g., destination is no longer first Assembly can provide 'pseudoinstructions' e.g., “move r10, r11” exists only in Assembly would be implemented using “add r10,r11,r0” When considering performance you should count real instructions DLX arithmetic ALU instructions can have 3 operands add $R1, $R2, $R3 sub $R1, $R2, $R3 Operand order is fixed (destination first) Example: C code: A = B + C DLX code: add r1, r2, r3 (registers associated with variables by compiler) DLX arithmetic Design Principle: simplicity favors regularity. Of course this complicates some things... C code: Why? A = B + C + D; E = F - A; MIPS code: add r1, r2, r3 add r1, r1, r4 sub r5, r6, r1 Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why? Execution assembly instructions Program counter holds the instruction address CPU fetches instruction from memory and puts it onto the instruction register Control logic decodes the instruction and tells the register file, ALU and other registers what to do An ALU operation (e.g. add) data flows from register file, through ALU and back to register file ALU Execution Example ALU Execution example Memory Instructions Load and store instructions lw r11, offset(r10); sw r11, offset(r10); Example: C code: A[8] = h + A[8]; assume h in r2 and base address of the array A in r3 DLX code: lw r4, 32(r3) add r4, r2, r4 sw r4, 32(r3) Store word has destination last Remember arithmetic operands are registers, not memory! Memory Operations Loads Load data from memory lw R6, 0(R5) # R6 <= mem[0x14] Memory Operations - Stores Storing data to memory works essentially the same way sw R6 , 0(R5) R6 = 200; let’s assume R5 =0x18 mem[0x18] <-- 200 So far we’ve learned: DLX — loading words but addressing bytes — arithmetic on registers only Instruction Meaning add r1, r2, r3 sub r1, r2, r3 lw r1, 100(r2) sw r1, 100(r2) r1 = r2 + r3 r1 = r2 – r3 r1 = Memory[r2+100] Memory[r2+100] = r1 Use of Registers Example: a = ( b + c) - ( d + e) ; // C statement # r1 – r5 : a - e add r10, r2, r3 add r11, r4, r5 sub r1, r10, r11 a = b + A[4]; // add an array element to a var // r3 has address A lw r4, 16(r3) add r1, r2, r4 Use of Registers : load and store Example A[8]= a + A[6] ; lw r1, 24(r3) add r1, r2, r1 sw r1, 32(r3) // A is in r3, a is in r2 # r1 gets A[6] contents # r1 gets the sum # sum is put in A[8] load and store Ex: a = b + A[i]; // A is in r3, a,b, i in // r1, r2, r4 add add add r11, r4, r4 r11, r11, r11 r11, r11, r3 lw add r10, 0(r11) r1, r2, r10 # r11 = 2 * i # r11 = 4 * i # r11 =addr. of A[i] (r3+(4*i)) # r10 = A[i] # a = b + A[i] Example: Swap Swapping words temp = v[0] v[0] = v[1]; v[1] = temp; swap: lw r10, lw r11, sw r10, sw r11, r2 has the base address of the array v 0(r2) 4(r2) 4(r2) 0(r2) DLX Instruction Format Instruction Format I-type R-type J-type I-type Instructions 6 5 Opcode Rs1 5 16 rd Immediate R-type Instructions 6 Opcode 5 5 5 Rs1 Rs2 rd Immediate J-type Instructions 6 Opcode 11 26 Offset added to PC Machine Language Instructions, like registers and words of data, are also 32 bits long Example: add r10, r1, r2 registers have numbers, 10, 1, 2 Instruction Format: R-type Instructions 6 Opcode 5 5 Rs1 Rs2 5 rd 11 func 000000 00001 00010 01010 0…000100000 Machine Language Consider the load-word and store-word instructions, What would the regularity principle have us do? New principle: Good design demands a compromise Introduce a new type of instruction format I-type for data transfer instructions other format was R-type for register Example: lw r10, 32(r2) I-type Instructions (Loads/stores) 6 5 Opcode Rs1 5 16 rd Immediate 1000011 01010 00010 0…000100000 Machine Language Jump instructions Example: j .L1 J-type Instructions (Jump, Jump and Link, Trap, return from exception) 6 Opcode 0000010 26 Offset added to PC offset to .L1 DLX Instruction Format Instruction Format I-type R-type J-type I-type Instructions 6 5 Opcode Rs1 5 16 rd Immediate R-type Instructions 6 Opcode 5 5 5 Rs1 Rs2 rd Immediate J-type Instructions 6 Opcode 16 26 Offset added to PC Instructions for Making Decisions beq reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 equals the value in reg2 bne reg1, reg2, L1 Go to the statement labeled L1 if the value in reg1 does not equals the value in reg2 j L1 Unconditional jump jr r10 “jump register”. Jump to the instruction specified in register r10 Making Decisions Example if ( a != b) goto L1; x = y + z; L1 : x = x – a; // x,y,z,a,b mapped to r1-r5 bne r4, r5, L1 # goto L1 if a != b add r1, r2, r3 # x = y + z (ignored if a=b) L1:sub r1, r1, r4 # x = x – a (always ex) if-then-else Example: if ( a==b) x = y + z; else x = y – z ; bne r4, r5, Else add r1, r2, r3 j Exit # goto Else if a!=b # x = y + z # goto Exit r1,r2,r3 # x = y – z Else : sub Exit : Example: Loop with array index Loop: g = g + A [i]; i = i + j; if (i != h) .... goto Loop r1, r2, r3, r4 = g, h, i, j, array base = r5 LOOP: add add add lw add add bne r11, r3, r3 r11, r11, r11 r11, r11, r5 r10, 0(r11) r1, r1, r10 r3, r3, r4 r3, r2, LOOP #r11 = 2 * i #r11 = 4 * i #r11 = adr. Of A[i] #load A[i] #g = g + A[i] #i = i + j Other decisions Set R1 on R2 less than R3 slt R1, R2, R3 Compares two registers, R2 and R3 R1 = 1 if R2 < R3 else R1 = 0 if R2 >= R3 Example slt r11, r1, r2 Branch less than Example: if(A < B) goto LESS slt r11, r1, r2 #t1 = 1 if A < B bne r11, $0, LESS Loops Example : while ( A[i] == k ) // i,j,k in r3. r4, r5 // A is in r6 i = i + j; Loop: sll r11, r3, 2 add lw bne add j Exit: r11, r11, r6 r10, 0(r11) r10, r5, Exit r3, r3, r4 Loop # r11 = 4 * i # # # # # r11 = addr. Of A[i] r10 = A[i] goto Exit if A[i]!=k i = i + j goto Loop Addresses in Branches and Jumps Instructions: bne r14,r15,Label beq r14,r15,Label j Label Next instruction is at Label if r14≠r15 Next instruction is at Label if r14=r15 Next instruction is at Label Formats: I op J op rs rt 16 bit address 26 bit address Addresses are not 32 bits — How do we handle this with large programs? — First idea: limitation of branch space to the first 216 bits Addresses in Branches Instructions: bne r14,r15,Label beq r14,r15,Label Next instruction is at Label if r14≠r15 Next instruction is at Label if r14=r15 Formats: I op rs rt 16 bit address Treat the 16 bit number as an offset to the PC register – PC-relative addressing Word offset instead of byte offset, why?? most branches are local (principle of locality) Jump instructions just use the high order bits of PC – Pseudodirect addressing 32-bit jump address = 4 Most Significant bits of PC concatenated with 26-bit word address (or 28- bit byte address) Address boundaries of 256 MB Conditional Branch Distance Int. Avg. FP Avg. 40% 30% 20% 10% 0% 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Words of Branch Dispalcement • 65% of integer branches are 2 to 4 instructions Conditional Branch Addressing PC-relative since most branches are relatively close to the current PC At least 8 bits suggested (128 instructions) Compare Equal/Not Equal most important for integer programs (86%) 7% LT/GE 40% Int Avg. 7% GT/LE 23% EQ/NE 0% FP Avg. 86% 37% 50% Frequency of comparison types in branches 100% PC-relative addressing For larger distances: Jump register jr required. Example LOOP: EXIT: mult lw bne add j ... $9, $19, $10 # R9 = R19*R10 $8, 1000($9) # R8 = @(R9+1000) $8, $21, EXIT $19, $19, $20 #i = i + j LOOP Assume address of LOOP is 0x8000 op rs rt 80000 0 19 10 80004 35 9 8 1000 80008 5 8 21 2 80012 0 19 20 80016 2 0x8000 80020 ... 9 0 24 8 19 80000 0 32 Procedure calls Procedures or subroutines: Needed for structured programming Steps followed in executing a procedure call: Place parameters in a place where the procedure (callee) can access them Transfer control to the procedure Acquire the storage resources needed for the procedure Perform desired task Place results in a place where the calling program (caller) can access them Return control to the point of origin Resources Involved Registers used for procedure calling: $a0 - $a3 : four argument registers in which to pass parameters $v0 - $v1 : two value registers in which to return values r31 : one return address register to return to the point of origin Transferring the control to the callee: jal ProcedureAddress jump-and-link to the procedure address the return address (PC+4) is saved in r31 Example: jal 20000 Returning the control to the caller: jr r31 instruction following jal is executed next Memory Stacks Useful for stacked environments/subroutine call & return even if operand stack not part of architecture Stacks that Grow Up vs. Stacks that Grow Down: High address inf. Big a b c SP Low address 0 Little grows up grows down 0 Little inf. Big Memory Addresses Calling conventions int func(int g, int h, int i, int j) { int f; f = ( g + h ) – ( i + j ) ; return ( f ); } // g,h,i,j - $a0,$a1,$a2,$a3, f in r8 func : addi$sp, $sp, -12 #make room in stack for 3 words sw r11, 8($sp) #save the regs we want to use sw r10, 4($sp) sw r8, 0($sp) add r10, $a0, $a1 #r10 = g + h add r11, $a2, $a3 #r11 = i + j sub r8, r10, r11 #r8 has the result add $v0, r8, r0 #return reg $v0 has f Calling (cont.) lw r8, 0($sp) # restore r8 lw r10, 4($sp) # restore r10 lw r11, 8($sp) # restore r11 addi $sp, $sp, 12 # restore sp jr $ra • we did not have to restore r10-r19 (caller save) • we do need to restore r1-r8 (must be preserved by callee) Nested Calls Stacking of Subroutine Calls & Returns and Environments: A A: CALL B A B B: CALL C C: D: RET A B C A B RET A Some machines provide a memory stack as part of the architecture (e.g., VAX, JVM) Sometimes stacks are implemented via software convention Compiling a String Copy Proc. void { strcpy ( char x[ ], y[ ]) int i=0; while ( x[ i ] = y [ i ] != 0) i ++ ; } // x and y base addr. are in $a0 and $a1 strcpy : addi $sp, $sp, -4 # reserve 1 word space in stack sw r8, 0($sp) # save r8 add r8, $zer0, $zer0 # i = 0 L1 : add r11, $a1, r8 # addr. of y[ i ] in r11 lb r12, 0(r11) # r12 = y[ i ] add r13, $a0, r8 # addr. Of x[ i ] in r13 sb r12, 0(r13) # x[ i ] = y [ i ] beq r12, $zero, L2 # if y [ i ] = 0 goto L2 addi r8, r8, 1 # i ++ j L1 # go to L1 L2 : lw r8, 0($sp) # restore r8 addi $sp, $sp, 4 # restore $sp jr $ra # return IA - 32 1978: The Intel 8086 is announced (16 bit architecture) 1980: The 8087 floating point coprocessor is added 1982: The 80286 increases address space to 24 bits, +instructions 1985: The 80386 extends to 32 bits, new addressing modes 1989-1995: The 80486, Pentium, Pentium Pro add a few instructions (mostly designed for higher performance) 1997: 57 new “MMX” instructions are added, Pentium II 1999: The Pentium III added another 70 instructions (SSE) 2001: Another 144 instructions (SSE2) 2003: AMD extends the architecture to increase address space to 64 bits, widens all registers to 64 bits and other changes (AMD64) 2004: Intel capitulates and embraces AMD64 (calls it EM64T) and adds more media extensions “This history illustrates the impact of the “golden handcuffs” of compatibility “adding new features as someone might add clothing to a packed bag” “an architecture that is difficult to explain and impossible to love” IA-32 Overview Complexity: Instructions from 1 to 17 bytes long one operand must act as both a source and destination one operand can come from memory complex addressing modes e.g., “base or scaled index with 8 or 32 bit displacement” Saving grace: the most frequently used instructions are not too difficult to build compilers avoid the portions of the architecture that are slow “what the 80x86 lacks in style is made up in quantity, making it beautiful from the right perspective” IA32 Registers Oversimplified Architecture Four 32-bit general purpose registers: eax, ebx, ecx, edx al is a register to mean “the lower 8 bits of eax” Stack Pointer esp Fun fact: Once upon a time, only x86 was a 16-bit CPU So, when they upgraded x86 to 32-bits... Added an “e” in front of every register and called it “extended” Intel 80x86 Integer Registers GPR0 EAX Accumulator GPR1 ECX Count register, string, loop GPR2 EDX Data Register; multiply, divide GPR3 EBX Base Address Register GPR4 ESP Stack Pointer GPR5 EBP Base Pointer – for base of stack seg. GPR6 ESI Index Register GPR7 EDI Index Register CS Code Segment Pointer SS Stack Segment Pointer DS Data Segment Pointer ES Extra Data Segment Pointer FS Data Seg. 2 GS Data Seg. 3 EIP Instruction Counter Eflags Condition Codes PC x86 Assembly mov <dest>, <src> Move the value from <src> into <dest> Used to set initial values add <dest>, <src> Add the value from <src> to <dest> sub <dest>, <src> Subtract the value from <src> from <dest> x86 Assembly push <target> Push the value in <target> onto the stack Also decrements the stack pointer, ESP (remember stack grows from high to low) pop <target> Pops the value from the top of the stack, put it in <target> Also increments the stack pointer, ESP x86 Assembly jmp <address> Jump to an instruction (like goto) Change the EIP to <address> Call <address> A function call. Pushes the current EIP + 1 (next instruction) onto the stack, and jumps to <address> x86 Assembly lea <dest>, <src> Load Effective Address of <src> into register <dest>. Used for pointer arithmetic (not actual memory reference) int <value> interrupt – hardware signal to operating system kernel, with flag <value> int 0x80 means “Linux system call” x86 Assembly Condition Codes: CZ: Carry Flag – Overflow Detection (Unsigned) ZF: Zero Flag SF: Sign Flag OF: Overflow Flag – Overflow Detection (Signed) Conditional Codes are you usually accessed through conditional branches (Not Directly) Interrupt convention int 0x80 – System call interupt eax – System call number (eg. 1-exit, 2-fork, 3-read, 4-write) ebx – argument #1 ecx – argument #2 edx – argument #3 CISC vs RISC RISC = Reduced Instruction Set Computer (DLX) CISC = Complex Instruction Set Computer (x86) Both have their advantages. RISC Not very many instructions All instructions all the same length in both execution time, and bit length Results in simpler CPU’s (Easier to optimize) Usually takes more instructions to express useful operation CISC Very many instructions (x86 instruction set over 700 pages) Variable length instructions More Complex CPU’s Usually takes fewer instructions to express useful operation (lower memory requirements) Some links Great stuff on x86 asm: http://www.cs.dartmouth.edu/~cs37/ Instruction Architecture Ideas If code size is most important, use variable length instructions If simplicity is most important, use fixed length instructions Simpler ISA’s are easier to optimize! Recent embedded machines (ARM, MIPS) added optional mode to execute subset of 16-bit wide instructions (Thumb, MIPS16); per procedure decide performance or density Summary Instruction complexity is only one variable lower instruction count vs. higher CPI / lower clock rate Design Principles: simplicity favors regularity smaller is faster good design demands compromise make the common case fast Instruction set architecture a very important abstraction indeed!