LOGO P r i n c e s s S u m a y a U n i v e r s i t y f o r Computer Architecture Dr. Esam Al_Qaralleh Te c h n o l o g y Memory Hierarchy Design 2 LOGO 5.1 Introduction Memory Hierarchy Design Motivated by the principle of locality - A 90/10 type of rule Take advantage of 2 forms of locality • Spatial - nearby references are likely • Temporal - same reference is likely soon Also motivated by cost/performance structures Smaller hardware is faster: SRAM, DRAM, Disk, Tape Access vs. bandwidth variations Fast memory is more expensive Goal – Provide a memory system with cost almost as low as the cheapest level and speed almost as fast as the fastest level 4 Memory relevance in Computer Design ? A computer’s performance is given by the number of instructions executed per time unit The time for executing an instruction depends on: The ALU speed (I.e. the data-path cycle duration) The time it takes for each instruction to load/store its operands/result from/into the memory (in brief, the time to access memory) The processing speed (CPU speed) grows faster than the memory speed. As a result the CPU speed cannot be fully exploited. This speed gap leads to an Unbalanced System ! 5 DRAM/CPU Gap CPU performance improves at 55%/year In 1996 it was a phenomenal 18% per month DRAM - has improved at 7% per year 6 Levels in A Typical Memory Hierarchy 7 Sample Memory Hierarchy 8 Unit of Transfer / Addressable Unit Unit of Transfer: Number of bits read from, or written into memory at a time Internal : usually governed by data bus width External : usually a block of words e.g 512 or more. Addressable unit: smallest location which can be uniquely addressed Internal : word or byte External : device dependent e.g. a disk “cluster” 9 Access Method Sequential Data is stored in records, access is in linear sequence (tape) Direct Data blocks have a unique and direct access, data within block is sequential (disk) Random Data has unique and direct access (ram) Associative Data retrieved based on (partial) match rather than address (cache) 10 Sequential Access Method Start at the beginning – read through in order Access time depends on location of data and previous location first location start e.g. tape ... read to here location of interest 11 Direct Access Method ... Individual blocks have unique address Access is by jumping to vicinity plus sequential search (or waiting! e.g. waiting for disk to rotate) Access time depends on target location and previous location e.g. disk jump to here read to here block i 12 Random Access Method Individual addresses identify specific locations Access time independent of location or previous access e.g. RAM ... read here 13 Associative Access Method Data is located by a comparison with contents of a portion of the store. ( Content Addressable Memory). Access time is independent of location or previous access e.g. Cache 14 Registers Used for storing data 32 bits 16 words Register files Multiple read/write ports possible Example: 32-bit wide by 16-bit deep, dual-port parallel read, single port parallel write register file ... 32 15 Implementing Registers Using Logic Gates Flip-flops Simple SR latch: S Q Q R S Q R Q S 1 1 0 0 R 1 0 1 0 Q Q 0 1 x Q’ Q’ 1 0 x Flip-flops • JK, D, T • Clocked • Master-slave (edge-triggered) 16 Implementing Registers in CMOS Direct gate implementation too costly A master-slave JK flip-flop uses 38 CMOS transistors Directly implement in transistors Example: clocked SR FF Q Q Q f f S R 17 Internal Organization of Memory Chips b7 b7 b1 b1 b0 b0 W0 • • • FF A0 A2 • • • A1 W1 FF Address decoder • • • • • • • • • • • • • • • • • • Memory cells A3 • • • W15 Sense / Write circuit Data input/output lines: b7 Sense / Write circuit b1 Sense / Write circuit R/W CS b0 Figure 5.2. Organization of bit cells in a memory chip. 18 A Memory Chip 5-bit row address W0 W1 5-bit decoder 32 32 memory cell array W31 10-bit address Sense/Write circuitry 32-to-1 output multiplexer and input demultiplexer R/ W CS 5-bit column address Data input/output Organization of a 1K 1 memory chip. 19 Static Memories The circuits are capable of retaining their state as long as power is applied. b b T1 X Y T2 Word line Bit lines A static RAM cell. 20 b b Vsupply T3 T4 T1 T2 X Y T5 T6 Word line Bit lines Static Memories Figure 5.5. An example of a CMOS memory cell. CMOS cell: low power consumption 21 Asynchronous DRAMs Static RAMs are fast, but they cost more area and are more expensive. Dynamic RAMs (DRAMs) are cheap and area efficient, but they can not retain their state indefinitely – need to be periodically refreshed. Bit line Word line T C A single-transistor dynamic memory cell 22 A Dynamic Memory Chip RA S Row Addr. Strobe Row address latch A20 - 9 A 8 - Row decoder 4096 (512 8) cell array Sense / Write circuits 0 Column address latch CA S CS R/ W Column decoder D7 D0 Column Addr. Strobe Internal organization of a 2M 8 dynamic memory chip. 23 LOGO 5.2 Review of the ABCs of Caches 36 Basic Terms on Caches Cache Full associative Write allocate Virtual memory dirty bit unified cache memory stall cycles block offset misses per instruction directed mapped write back block valid bit data cache locality block address hit time address trace write through cache miss set instruction cache page fault random placement average memory access time miss rate index field cache hit n-way set associative no-write allocate page least-recently used write buffer miss penalty tag field write stall 25 Cache The first level of the memory hierarchy encountered once the address leaves the CPU Persistent mismatch between CPU and main-memory speeds Exploit the principle of locality by providing a small, fast memory between CPU and main memory -- the cache memory Cache is now applied whenever buffering is employed to reuse commonly occurring terms (ex. file caches) Caching – copying information into faster storage system Main memory can be viewed as a cache for secondary storage 26 General Hierarchy Concepts At each level - block concept is present (block is the caching unit) Block size may vary depending on level • Amortize longer access by bringing in larger chunk • Works if locality principle is true Hit - access where block is present - hit rate is the probability Miss - access where block is absent (in lower levels) - miss rate Mirroring and consistency Data residing in higher level is subset of data in lower level Changes at higher level must be reflected down - sometime • Policy of sometime is the consistency mechanism Addressing Whatever the organization you have to know how to get at it! Address checking and protection 27 Physical Address Structure Key is that you want different block sizes at different levels 28 Latency and Bandwidth The time required for the cache miss depends on both latency and bandwidth of the memory (or lower level) Latency determines the time to retrieve the first word of the block Bandwidth determines the time to retrieve the rest of this block A cache miss is handled by hardware and causes processors following in-order execution to pause or stall until the data are available 29 Predicting Memory Access Times On a hit: simple access time to the cache On a miss: access time + miss penalty Miss penalty = access time of lower + block transfer time Block transfer time depends on • Block size - bigger blocks mean longer transfers • Bandwidth between the two levels of memory – Bandwidth usually dominated by the slower memory and the bus protocol Performance Average-Memory-Access-Time = Hit-Access-Time + Miss-Rate * Miss-Penalty Memory-stall-cycles = IC * Memory-reference-perinstruction * Miss-Rate * Miss-Penalty 30 Block Sizes, Miss Rates & Penalties, Accesses 31 Headaches of Memory Hierarchies CPU never knows for sure if an access will hit How deep will a miss be - i. e. miss penalty If short then the CPU just waits If long then probably best to work on something else – task switch • Implies that the amount can be predicted with reasonable accuracy • Task switch better be fast or productivity/efficiency will suffer Implies some new needs More hardware accounting Software readable accounting information (address trace) 32 Four Standard Questions Block Placement Where can a block be placed in the upper level? Block Identification How is a block found if it is in the upper level? Block Replacement Which block should be replaced on a miss? Write Strategy What happens on a write? Answer the four questions for the first level of the memory hierarchy 33 Block Placement Options Direct Mapped (Block address) MOD (# of cache blocks) Fully Associative Can be placed anywhere Set Associative Set is a group of n blocks -- each block is called a way Block first mapped into a set (Block address) MOD (# of cache sets) Placed anywhere in the set Most caches are direct mapped, 2- or 4-way set associative 34 Block Placement Options (Cont.) 35 Block Identification Each cache block carries tags Address Tags: which block am I? Many memory blocks may map to the same cache block Physical address now: address tag## set index## block offset Note relationship of block size, cache size, and tag size The smaller the set tag the cheaper it is to find Status Tags: what state is the block in? valid, dirty, etc. Physical address = r + m + n bits r (address tag) 2m addressable sets in the cache m (set index) n (block offset) 2n bytes per block 36 Block Identification (Cont.) Physical address = r + m + n bits r (address tag) 2m addressable sets in the cache m n 2n bytes per block • Caches have an address tag on each block frame that gives the block address. • A valid bit to say whether or not this entry contains a valid address. • The block frame address can be divided into the tag field and the index field. 37 Block Replacement Random: just pick one and chuck it Simple hash game played on target block frame address Some use truly random • But lack of reproducibility is a problem at debug time LRU - least recently used Need to keep time since each block was last accessed • Expensive if number of blocks is large due to global compare • Hence approximation is oftenOnly usedone = Use bitfor tagdirect-mapped and LFU choice FIFO placement 38 Data Cache Misses Per 1000 Instructions 64 byte blocks on a Alpha using 10 SPEC2000 39 Short Summaries from the Previous Figure More-way associative is better for small cache 2- or 4-way associative perform similar to 8-way associative for larger caches Larger cache size is better LRU is the best for small block sizes Random works fine for large caches FIFO outperforms random in smaller caches Little difference between LRU and random for larger caches 40 Improving Cache Performance MIPS mix is 10% stores and 37% loads Writes are about 10%/(100%+10%+37%) = 7% of overall memory traffic, and 10%/(10%+37%)=21% of data cache traffic Make the common case fast Implies optimizing caches for reads Read optimizations Block can be read concurrent with tag comparison On a hit the read information is passed on On a miss the - nuke the block and start the miss access Write optimizations Can’t modify until after tag check - hence take longer 41 Write Options Write through: write posted to cache line and through to next lower level Incurs write stall (use an intermediate write buffer to reduce the stall) Write back Only write to cache not to lower level Implies that cache and main memory are now inconsistent • Mark the line with a dirty bit • If this block is replaced and dirty then write it back Pro’s and Con’s both are useful Write through • No write on read miss, simpler to implement, no inconsistency with main memory Write back • Uses less main memory bandwidth, write times independent of main memory speeds • Multiple writes within a block require only one write to the main memory 42 LOGO 5.3 Cache Performance Cache Performance 44 Cache Performance Example Each instruction takes 2 clock cycle (ignore memory stalls) Cache miss penalty – 50 clock cycles Miss rate = 2% Average 1.33 memory reference per instructions • • • • Ideal – IC * 2 * cycle-time With cache – IC*(2+1.33*2%*50)*cycle-time = IC * 3.33 * cycle-time No cache – IC * (2+1.33*100%*50)*cycle-time The importance of cache for CPUs with lower CPI and higher clock rates is greater – Amdahl’s Law 45 Average Memory Access Time VS CPU Time Compare two different cache organizations Miss rate – direct-mapped (1.4%), 2-way associative (1.0%) Clock-cycle-time – direct-mapped (2.0ns), 2-way associative (2.2ns) CPI with a perfect cache – 2.0, average memory reference per instruction – 1.3; miss-penalty – 70ns; hittime – 1 CC • Average Memory Access Time (Hit time + Miss_rate * Miss_penalty) • AMAT(Direct) = 1 * 2 + (1.4% * 70) = 2.98ns • AMAT(2-way) = 1 * 2.2 + (1.0% * 70) = 2.90ns • CPU Time • CPU(Direct) = IC * (2 * 2 + 1.3 * 1.4% * 70) = 5.27 * IC • CPU(2-way) = IC * (2 * 2.2 + 1.3 * 1.0% * 70) = 5.31 * IC Since CPU time is our bottom-line evaluation, and since direct mapped is simpler to build, the preferred cache is direct mapped in this example 46 Unified and Split Cache Unified – 32KB cache, Split – 16KB IC and 16KB DC Hit time – 1 clock cycle, miss penalty – 100 clock cycles Load/Store hit takes 1 extra clock cycle for unified cache 36% load/store – reference to cache: 74% instruction, 26% data • Miss rate(16KB instruction) = 3.82/1000/1.0 = 0.004 Miss rate (16KB data) = 40.9/1000/0.36 = 0.114 • Miss rate for split cache – (74%*0.004) + (26%*0.114) = 0.0324 Miss rate for unified cache – 43.3/1000/(1+0.36) = 0.0318 • Average-memory-access-time = % inst * (hit-time + inst-miss-rate * miss-penalty) + % data * (hit-time + data-miss-rate * miss-penalty) • AMAT(Split) = 74% * (1 + 0.004 * 100) + 26% * (1 + 0.114 * 100) = 4.24 • AMAT(Unified) = 74% * (1 + 0.0318 * 100) + 26% * (1 + 1 + 0.0318* 100) = 4.44 47 Improving Cache Performance Average-memory-access-time = Hittime + Miss-rate * Miss-penalty Strategies for improving cache performance Reducing the miss penalty Reducing the miss rate Reducing the miss penalty or miss rate via parallelism Reducing the time to hit in the cache 48 LOGO 5.4 Reducing Cache Miss Penalty Techniques for Reducing Miss Penalty Multilevel Caches (the most important) Critical Word First and Early Restart Giving Priority to Read Misses over Writes Merging Write Buffer Victim Caches 50 Multi-Level Caches Probably the best miss-penalty reduction Performance measurement for 2-level caches AMAT = Hit-time-L1 + Miss-rate-L1* Misspenalty-L1 Miss-penalty-L1 = Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2 AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-timeL2 + Miss-rate-L2 * Miss-penalty-L2) 51 Multi-Level Caches (Cont.) Definitions: Local miss rate: misses in this cache divided by the total number of memory accesses to this cache (Missrate-L2) Global miss rate: misses in this cache divided by the total number of memory accesses generated by CPU (Miss-rate-L1 x Miss-rate-L2) Global Miss Rate is what matters Advantages: Capacity misses in L1 end up with a significant penalty reduction since they likely will get supplied from L2 • No need to go to main memory Conflict misses in L1 similarly will get supplied by L2 52 Miss Rate Example Suppose that in 1000 memory references there are 40 misses in the first-level cache and 20 misses in the second-level cache Miss rate for the first-level cache = 40/1000 (4%) Local miss rate for the second-level cache = 20/40 (50%) Global miss rate for the second-level cache = 20/1000 (2%) 53 Miss Rate Example (Cont.) Assume miss-penalty-L2 is 100 CC, hit-time-L2 is 10 CC, hit-time-L1 is 1 CC, and 1.5 memory reference per instruction. What is average memory access time and average stall cycles per instructions? Ignore writes impact. AMAT = Hit-time-L1 + Miss-rate-L1 * (Hit-time-L2 + Miss-rate-L2 * Miss-penalty-L2) = 1 + 4% * (10 + 50% * 100) = 3.4 CC Average memory stalls per instruction = Misses-perinstruction-L1 * Hit-time-L2 + Misses-per-instructionsL2*Miss-penalty-L2 = (40*1.5/1000) * 10 + (20*1.5/1000) * 100 = 3.6 CC • Or (3.4 – 1.0) * 1.5 = 3.6 CC 54 Critical Word First and Early Restart Do not wait for full block to be loaded before restarting CPU Critical Word First – request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Also called wrapped fetch and requested word first Early restart -- as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Benefits of critical word first and early restart depend on Block size: generally useful only in large blocks Likelihood of another access to the portion of the block that has not yet been fetched • Spatial locality problem: tend to want next sequential word, so not clear if benefit block 55 Giving Priority to Read Misses Over Writes In write through, write buffers complicate memory access in that they might hold the updated value of location needed on a read miss RAW conflicts with main memory reads on cache misses Read miss waits until the write buffer empty increase read miss penalty (old MIPS 1000 with 4-word buffer by 50% ) Check write buffer contents before read, and if no conflicts, let the memory access continue Write Back? Read miss replacing dirty block Normal: Write dirty block to memory, and then do the read Instead copy the dirty block to a write buffer, then do the read, and then do the write CPU stall less since restarts as soon as do read 56 Merging Write Buffer An entry of write buffer often contain multi-words. However, a write often involves single word A single-word write occupies the whole entry if no write-merging Write merging: check to see if the address of a new data matches the address of a valid write buffer entry. If so, the new data are combined with that entry Advantage Multi-word writes are usually faster than single-word writes Reduce the stalls due to the write buffer being full 57 Write-Merging Illustration 58 Victim Caches Remember what was just discarded in case it is need again Add small fully associative cache (called victim cache) between the cache and the refill path Contain only blocks discarded from a cache because of a miss Are checked on a miss to see if they have the desired data before going to the next lower-level of memory • If yes, swap the victim block and cache block Addressing both victim and regular cache at the same time • The penalty will not increase Jouppi (DEC SRC) shows miss reduction of 20 - 95% For a 4KB direct mapped cache with 1-5 victim blocks 59 Victim Cache Organization 60 LOGO 5.5 Reducing Miss Rate Classify Cache Misses - 3 C’s Compulsory independent of cache size First access to a block no choice but to load it Also called cold-start or first-reference misses Capacity decrease as cache size increases Cache cannot contain all the blocks needed during execution, then blocks being discarded will be later retrieved Conflict (Collision) decrease as associativity increases Side effect of set associative or direct mapping A block may be discarded and later retrieved if too many blocks map to the same cache block 62 Techniques for Reducing Miss Rate Larger Block Size Larger Caches Higher Associativity Way Prediction Caches Compiler optimizations 63 Larger Block Sizes Obvious advantages: reduce compulsory misses Reason is due to spatial locality Obvious disadvantage Higher miss penalty: larger block takes longer to move May increase conflict misses and capacity miss if cache is small Don’t let increase in miss penalty outweigh the decrease in miss rate 64 Large Caches Help with both conflict and capacity misses May need longer hit time AND/OR higher HW cost Popular in off-chip caches 65 Higher Associativity 8-way set associative is for practical purposes as effective in reducing misses as fully associative 2: 1 Rule of thumb 2 way set associative of size N/ 2 is about the same as a direct mapped cache of size N (held for cache size < 128 KB) Greater associativity comes at the cost of increased hit time Lengthen the clock cycle Hill [1988] suggested hit time for 2-way vs. 1-way: external cache +10%, internal + 2% 66 Effect of Higher Associativity for AMAT Clock-cycle-time (2-way) = 1.10 * Clock-cycle-time (1-way) Clock-cycle-time (4-way) = 1.12 * Clock-cycle-time (1-way) Clock-cycle-time (8-way) = 1.14 * Clock-cycle-time (1-way) 67 Way Prediction Extra bits are kept in cache to predict the way, or block within the set of the next cache access Multiplexor is set early to select the desired block, and only a single tag comparison is performed that clock cycle A miss results in checking the other blocks for matches in subsequent clock cycles Alpha 21264 uses way prediction in its 2-way set-associative instruction cache. Simulation using SPEC95 suggested way prediction accuracy is in excess of 85% 68 Compiler Optimization for Code Code can easily be arranged without affecting correctness Reordering the procedures of a program might reduce instruction miss rates by reducing conflict misses McFarling's observation using profiling information [1988] Reduce miss by 50% for a 2KB direct-mapped instruction cache with 4-byte blocks, and by 75% in an 8KB cache Optimized programs on a direct-mapped cache missed less than unoptimized ones on an 8-way setassociative cache of same size 69 Compiler Optimization for Data Idea – improve the spatial and temporal locality of the data Lots of options Array merging – Allocate arrays so that paired operands show up in same cache block Loop interchange – Exchange inner and outer loop order to improve cache performance Loop fusion – For independent loops accessing the same data, fuse these loops into a single aggregate loop Blocking – Do as much as possible on a sub- block before moving on 70 Merging Arrays Example /* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; val key /* After: 1 array of stuctures */ struct merge { int val; val key val key val key int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality 71 Loop Interchange Example /* Before */ for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ for (i = 0; i < 5000; i = i+1) for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j]; Sequential accesses instead of striding through memory every 100 words; improve spatial locality 72 Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} Perform different computations on the common data in two loops fuse the two loops 2 misses per access to a & c vs. one miss per access; Improve temporal locality 73 LOGO 5.7 Reducing Hit Time Reducing Hit Time Hit time is critical because it affects the clock cycle time On many machines, cache access time limits the clock cycle rate A fast hit time is multiplied in importance beyond the average memory access time formula because it helps everything Average-Memory-Access-Time = HitAccess-Time + Miss-Rate * Miss-Penalty • Miss-penalty is clock-cycle dependent 75 Techniques for Reducing Hit Time Small and Simple Caches Avoid Address Translation during Indexing of the Cache Pipelined Cache Access Trace Caches 76 Small and Simple Caches A time-consuming portion of a cache hit: use the index portion to read the tag and then compare it to the address Small caches – smaller hardware is faster Keep the L1 cache small enough to fit on the same chip as CPU Keep the tags on-chip, and the data off-chip for L2 caches Simple caches – direct-Mapped cache Trading hit time for increased miss-rate • Small direct mapped misses more often than small associative caches • But simpler structure makes the hit go faster 77 Pipelining Writes for Fast Write Hits – Pipelined Cache Write hits usually take longer than read hits Tag must be checked before writing the data Pipelines the write 2 stages – Tag Check and Update Cache (can be more in practice) Current write tag check & previous write cache update Result Looks like a write happens on every cycle Cycle-time can stay short since real write is spread over Mostly works if CPU is not dependent on data from a write • Spot any problems if read and write ordering is not preserved by the memory system? Reads play no part in this pipeline since they already operate in parallel with the tag check 78 Cache Optimization Summary 79 LOGO 5.9 Main Memory Main Memory -- 3 important issues Capacity Latency Access time: time between a read is requested and the word arrives Cycle time: min time between requests to memory (> access time) • Memory needs the address lines to be stable between accesses By addressing big chunks - like an entire cache block (amortize the latency) Critical to cache performance when the miss is to main Bandwidth -- # of bytes read or written per unit time Affects the time it takes to transfer the block 81 Example of Memory Latency and Bandwidth Consider 4 cycle to send the address 56 cycles per word of access 4 cycle to transmit the data Hence if main memory is organized by word 64 cycles has to be spent for every word we want to access Given a cache line of 4 words (8 bytes per word) 256 cycles is the miss penalty Memory bandwidth = 1/8 byte per clock cycle (4 * 8 /256) 82 Improving Main Memory Performance Simple: CPU, Cache, Bus, Memory same width (32 or 64 bits) Wide: CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) Interleaved: CPU, Cache, Bus 1 word: Memory N Modules (4 Modules); example is word interleaved 83 3 Examples of Bus Width, Memory Width, and Memory Interleaving to Achieve Memory Bandwidth 84 Wider Main Memory Doubling or quadrupling the width of the cache or memory will doubling or quadrupling the memory bandwidth Miss penalty is reduced correspondingly Cost and Drawback More cost on memory bus Multiplexer between the cache and the CPU may be on the critical path (CPU is still access the cache one word at a time) • Multiplexors can be put between L1 and L2 The design of error correction become more complicated • If only a portion of the block is updated, all other portions must be read for calculating the new error correction code Since main memory is traditionally expandable by the customer, the minimum increment is doubled or quadrupled 85 Simple Interleaved Memory Memory chips are organized into banks to read or write multiple words at a time, rather than a single word Share address lines with a memory controller Keep the memory bus the same but make it run faster Take advantage of potential memory bandwidth of all DRAMs banks The banks are often one word wide Good for accessing consecutive memory location Miss penalty of 4 + 56 + 4 * 4 or 76 CC (0.4 bytes per CC) Bank_# = address MOD #_of_banks Address_within_bank = Floor(Address / #_of_bank Interleaving factor = #_of_banks (usually power of 2) Organization of Four-way Interleaved Memory 86 What Can Interleaving and a Wide Memory Buy? Block size = 1, 2, 4 words. Miss rate = 3%, 2% 1.2% correspondingly Memory Bus width = 1 word, memory access per instruction = 1.2 Cache miss penalty = 64 cycles (as above) Average cycles per instruction (ignore cache misses) = 2 CPI = 2 + (1.2 * 3% *64) = 4.3 (1-word block) • • • • Block size = 2 words 64-bit bus and memory, no interleaving = 2 + (1.2 * 2% * 2 * 64) = 5.07 64-bit bus and memory, interleaving = 2 + (1.2 * 2% * (4+56+2*4)) = 3.63 128-bit bus and memory, no interleaving = 2 + (1.2 * 2% * 1* 64) = 3.54 • • • • Block size = 4 words 64-bit bus and memory, no interleaving = 2 + (1.2 * 1.2% * 4 *64) = 5.69 64-bit bus and memory, interleaving = 2 + (1.2 * 1.2% * (4+56+4*4)) = 3.09 128-bit bus and memory, no interleaving = 2 + (1.2 * 1.2% * 2 *64) = 3.84 87 Simple Interleaved Memory (Cont.) Interleaved memory is logically a wide memory, except that accesses to bank are staged over time to share bus How many banks should be included? More than # of CC to access word in bank • To achieve the goal that delivering information from a new bank each clock for sequential accesses avoid waiting Disadvantages Making multiple banks are expensive larger chip, few chips • 512MB RAM – 256 chips of 4M*4 bits 16 banks of 16 chips – 16 chips of 64M*4 bit only 1 bank More difficulty in main memory expansion (like wider memory) 88 LOGO 5.9 Memory Technology DRAM Technology Semiconductor Dynamic Random Access Memory Emphasize on cost per bit and capacity Multiplex address lines cutting # of address pins in half Row access strobe (RAS) first, then column access strobe (CAS) Memory as a 2D matrix – rows go to a buffer Subsequent CAS selects subrow Use only a single transistor to store a bit Reading that bit can destroy the information Refresh each bit periodically (ex. 8 milliseconds) by writing back • Keep refreshing time less than 5% of the total time DRAM capacity is 4 to 8 times that of SRAM 90 DRAM Technology (Cont.) DIMM: Dual inline memory module DRAM chips are commonly sold on small boards called DIMMs DIMMs typically contain 4 to 16 DRAMs Slowing down in DRAM capacity growth Four times the capacity every three years, for more than 20 years New chips only double capacity every two year, since 1998 DRAM performance is growing at a slower rate RAS (related to latency): 5% per year CAS (related to bandwidth): 10%+ per year 91 RAS improvement A performance improvement in RAS of about 5% per year 92 SRAM Technology Cache uses SRAM: Static Random Access Memory SRAM uses six transistors per bit to prevent the information from being disturbed when read no need to refresh SRAM needs only minimal power to retain the charge in the standby modegood for embedded applications No difference between access time and cycle time for SRAM Emphasize on speed and capacity SRAM address lines are not multiplexed SRAM speed is 8 to 16x that of DRAM 93 ROM and Flash Embedded processor memory Read-only memory (ROM) Programmed at the time of manufacture Only a single transistor per bit to represent 1 or 0 Used for the embedded program and for constant Nonvolatile and indestructible Flash memory: Nonvolatile but allow the memory to be modified Reads at almost DRAM speeds, but writes 10 to 100 times slower DRAM capacity per chip and MB per dollar is about 4 to 8 times greater than flash 94 Improving Memory Performance in a Standard DRAM Chip Fast page mode: time signals that allow repeated accesses to buffer without another row access time Synchronous RAM (SDRAM): add a clock signal to DRAM interface, so that the repeated transfer would not bear overhead to synchronize with the controller Asynchronous DRAM involves overhead to sync with controller Peak speed per memory module 800—1200MB/sec in 2001 Double data rate (DDR): transfer data on both the rising edge and falling edge of DRAM clock signal Peak speed per memory module 1600—2400MB/sec in 2001 95 LOGO 5.10 Virtual Memory Virtual Memory Virtual memory divides physical memory into blocks (called page or segment) and allocates them to different processes With virtual memory, the CPU produces virtual addresses that are translated by a combination of HW and SW to physical addresses, which accesses main memory. The process is called memory mapping or address translation Today, the two memory-hierarchy levels controlled by virtual memory are DRAMs and magnetic disks 97 Example of Virtual to Physical Address Mapping Mapping by a page table 98 Address Translation Hardware for Paging frame number frame offset f (l-n) d (n) 99 Page table when some pages are not in main memory… illegal access 100 Virtual Memory (Cont.) Permits applications to grow bigger than main memory size Helps with multiple process management Each process gets its own chunk of memory Permits protection of 1 process’ chunks from another Mapping of multiple chunks onto shared physical memory Mapping also facilitates relocation (a program can run in any memory location, and can be moved during execution) Application and CPU run in virtual space (logical memory, 0 – max) Mapping onto physical space is invisible to the application Cache VS. VM Block becomes a page or segment Miss becomes a page or address fault 101 Typical Page Parameters 102 Cache vs. VM Differences Replacement Cache miss handled by hardware Page fault usually handled by OS Addresses VM space is determined by the address size of the CPU Cache space is independent of the CPU address size Lower level memory For caches - the main memory is not shared by something else For VM - most of the disk contains the file system • File system addressed differently - usually in I/ O space • VM lower level is usually called SWAP space 103 2 VM Styles - Paged or Segmented? Virtual systems can be categorized into two classes: pages (fixed-size blocks), and segments (variable-size blocks) Page Segment Words per address One Two (segment and offset) Programmer visible? Invisible to application programmer May be visible to application programmer Replacing a block Trivial (all blocks are the same size) Hard (must find contiguous, variable-size, unused portion of main memory) Memory use inefficiency Internal fragmentation (unused portion of page) External fragmentation (unused pieces of main memory) Efficient disk traffic Yes (adjust page size to balance access time and transfer time) Not always (small segments may transfer just a few bytes) 104 Virtual Memory – The Same 4 Questions Block Placement Choice: lower miss rates and complex placement or vice versa • Miss penalty is huge, so choose low miss rate place anywhere • Similar to fully associative cache model Block Identification - both use additional data structure Fixed size pages - use a page table Variable sized segments - segment table frame number frame offset f (l-n) d (n) 105 Address Translation Hardware for Paging frame number frame offset f (l-n) d (n) 106 Virtual Memory – The Same 4 Questions (Cont.) Block Replacement -- LRU is the best However true LRU is a bit complex – so use approximation • Page table contains a use tag, and on access the use tag is set • OS checks them every so often - records what it sees in a data structure - then clears them all • On a miss the OS decides who has been used the least and replace that one Write Strategy -- always write back Due to the access time to the disk, write through is silly Use a dirty bit to only write back pages that have been modified 107 Techniques for Fast Address Translation Page table is kept in main memory (kernel memory) Each process has a page table Every data/instruction access requires two memory accesses One for the page table and one for the data/instruction Can be solved by the use of a special fast-lookup hardware cache called associative registers or translation look-aside buffers (TLBs) If locality applies then cache the recent translation TLB = translation look-aside buffer TLB entry: virtual page no, physical page no, protection bit, use bit, dirty bit 108 TLB = Translation Look-aside Buffer The TLB must be on chip; otherwise it is worthless Fully associative – parallel search Typical TLB’s Hit time - 1 cycle Miss penalty - 10 to 30 cycles Miss rate - .1% to 2% TLB size - 32 B to 8 KB 109 Paging Hardware with TLB 110 TLB of Alpha 21264 Address Space Number: process ID to prevent context switch A total of 128 TLB entries 111 Page Size – An Architectural Choice Large pages are good: Reduces page table size Amortizes the long disk access If spatial locality is good then hit rate will improve Reduce the number of TLB miss Large pages are bad: More internal fragmentation • If everything is random each structure’s last page is only half full Process start up time takes longer 112