Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg 1 Memory Bits: 0, 1; Bytes: 8 bits Memory size PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes; KB – 10^3 bytes Memory performance measures: Access time, or response time, latency: interval between time of issuance of memory request and time when request is satisfied. Cycle time: minimum time between two successive memory requests Memory busy t0 < t < t2 t0 Memory request t1 request satisfied DRAM only t2 Access time: t1-t0 Cycle time: t2-t0 If there is another request at t0 < t < t2, memory is busy and will not respond; have to wait until t > t2 2 Memory Hierarchy Memory can be fast (costly) or slow (cheaper). Increase overall performance: use locality of reference Faster memory (also smaller) closer to CPU; slower memory (also larger) farther away from CPU. Have often-used data in fast memory; leave lessoften-used data in slow memory. Key: When lower levels of hierarchy send value at location x to higher levels, also send content at x+1, x+2, etc. i.e. send a block of data Cache line 3 Memory Hierarchy Registers Cache: a piece of fast memory Expensive, CA$H ? Level-1 cache Increasing speed Increasing cost Decreasing size Level-2 cache Decreasing speed Decreasing cost Increasing size Main memory Secondary memory (hard disk) Network storage …… Performance of different levels can be very different e.g. access time for L1 cache can be 1 cycle, L2 can be 5 or 6 cycles, while main memory can be dozens of cycles and secondary memory can be orders of magnitude slower. 4 How Memory Hierarchy Works (RISC processor) CPU works only on data in registers. If data is not in register, request data from memory and load to register … Data in register come only from and go only to L1 cache. When CPU requests data from memory, L1 cache takes over; If data is in L1 cache (cache hit), return data to CPU immediately; end memory access; If data is not in L1 cache (cache miss) … 5 How Memory Hierarchy Works If data is not in L1 cache, L1 cache forwards memory request down to L2 cache. If L2 cache has the data (cache hit), it returns the data to L1 cache, which in turn returns data to CPU; end memory access; If L2 cache does not have the data (cache miss) … If data is not in L2 cache, L2 cache forwards memory request down to main memory. If data is in main memory, main memory passes data to L2 cache, which then passes data to L1 cache, which then passes data to CPU. If data is not in memory … Then request is passed to OS to read data from secondary storage (disk), which then is passed to memory, L2 cache, L1 cache, register. 6 Cache Line A cache line is the smallest unit of data that can be transferred to or from memory (and L2 cache). usually between 32 and 128 bytes May contain several data items When L2 cache passes data to L1 cache, or when main memory passes data to L2 cache, a cache line, instead of a single piece of data, is transferred. When the data in variable X is requested from memory, the cache line containing X (and adjacent data) is transferred to cache. Assume: 32-byte cache line, X[11] is requested by CPU Result: X[10] – X[13] is brought into cache from memory. X[9] X[10] X[11] X[12] Cache line X[13] X[14] Cache line 7 Cache Effect on Performance Cache miss degrading performance When there is a cache miss, CPU is idle waiting for another cache line to be brought from lower level of memory hierarchy Increasing cache hit rate higher performance Efficiency directly related to reuse of data in cache To increase cache hit rate, access memory sequentially; avoid strides, random access, and indirect addressing in programming. for(i=0;i<100;i++) y[i] = 2*x[i]; sequential access for(i=0;i<100;i=i+4) strides y[i] = 2*x[i]; for(i=0;i<100;i++) y[i] = 2*x[index[i]]; Indirect addressing 8 Where in Cache to Put Data from Memory Cache is organized into cache lines. Memory is also logically organized into cache lines. Memory size >> cache size 1 MB (32,768 cache lines) 32-byte cache line cache … Number of cache lines in memory >> number of cache lines in cache. Many cache lines in memory correspond to one cache line in cache. … Main memory 2 GB (67,108,864 cache lines) 9 Cache Classification Direct-mapped cache Given a memory cache line, it is always placed in one specific cache line in cache. Fully associative cache Given a memory cache line, it can be placed in any of the cache lines in cache. N-way set associative cache Given a memory cache line, it can be placed in any of a set of N cache lines in cache. 10 Direct-Mapped Cache A set of memory cache lines always correspond to exactly the same cache line in cache. Cheap to implement in hardware; May cause cache thrashing: repeatedly displacing and loading cache lines. 8 KB Line-Index = Mod (mem-cache-line-index, tot-cache-lines-in-cache) … … 0 … 8K … 16K … … 2G 11 Cache Thrashing: Example Assumptions: Direct-mapped cache; Cache size: 1 MB; Cache line: 32 bytes; 1 double value = 8 bytes 131072 double values = 1 MB 1 cache line = 32 bytes = 4 double values X[131072]: 1 MB memory Y[131072]: 1 MB memory double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; … 12 Cache Thrashing: Example i=0: load line X[0]-X[3] into cache; load X[0] from cache to register; load line Y[0]-Y[3] into cache, displacing line X[0]-X[3]; load Y[0] from cache into register; add, update Y[0] in cache; i=1: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[1] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[1] from cache to register; add, update Y[1] in cache; i=2: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[2] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; X[0] X[1] X[2] X[3] load Y[2] from cache to register; X[4] X[5] X[6] X[7] add, update Y[2] in cache; … … … … i=3: … No cache reuse! Poor performance! Avoid cache thrashing! … double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; … … … 1 MB 32768 lines … Y[0] Y[1] Y[2] Y[3] Y[4] Y[5] Y[6] Y[7] … … … … … … … … … … … … … … … … cache Memory 1 MB 32768 lines 13 Fully Associative Cache A cache line from memory can be placed anywhere in cache; No cache thrashing; but costly. Direct-mapped cache at one extreme of spectrum; fully associative cache at another extreme of spectrum. Disadvantage: search entire cache to determine if a specific cache line is present. 14 N-Way Set Associative Cache Compromise between direct-mapped cache and fully associative cache The cache lines in cache is divided into a number of sets; Each set contains N cache lines. Given a cache line from memory, the index of set it belongs to is first calculated; Then it is placed in one of the N cache lines in this set. 2-way set associative cache cache Main memory … 1 MB 32,768 cache lines 16,384 sets Each set has 2 lines … 2 GB (67,108,864 cache lines) Direct-mapped cache is 1-way set associative cache; Fully associative cache is N_c way set associative cache; N_c is total number of cache lines in cache. Less likely to cause cache thrashing; Less costly; 15 Instruction/Data Cache CPU may have separate instruction cache and data cache (split cache). CPU may have a single cache, for both instructions and data from memory (unified cache). 16 Remember … Efficiency directly related to cache reuse Cache thrashing is eliminated by padding arrays (array dimensions should not be a multiple of cache line – avoid powers of 2) To improve cache reuse, Access memory sequentially as much as possible Avoid stride, random access, indirect addressing Avoid cache thrashing. 17 Example X[0][0] double X[1024][1024], Y[1024][1024]; int i,j; … for(j=0;j<1024;j++) for(i=0;i<1024;i++) X[i][j] = Y[i][j]; X[0][1] …… stride 1024 or 8KB X[0][1023] X[1][0] X[1][1] …… X[1023][1023] Y[0][0] Y[0][1] Large stride in memory access pattern results in not only cache miss/poor reuse, but also TLB miss. …… Y[0][1023] Y[1][0] Y[1][1] …… Y[1023][1023] 18 4GB 2GB …… 1048KB Program #1 Modern computers use virtual memory; …… 1044KB 8KB 4KB 1040KB 0 1036KB 2GB 1032KB Program #2 …… 1028KB 8KB 1024KB 4KB …… 0 0 Virtual Memory, Memory Paging Physical Memory Memory address seen in a program (virtual address) is not the actual address in physical memory; Memory is divided into pages (e.g. 4KB); A memory page in program’s address space corresponds to a page in physical memory; To access memory, need to translate program’s virtual address to the actual address in physical memory. This is done using a page table; 19 Translation Look-aside Buffer (TLB) TLB is a special cache for the page tables Faster access to TLB for virtual-physical translation. When program accesses a memory location, the translation between virtual and physical pages is loaded into TLB (if it is not already there); If program exhibits locality of references, entries in TLB can be reused TLB hit better performance Otherwise TLB miss performance degrades. Large stride in memory access pattern TLB miss (and cache miss). 20 Remedies Use large memory page size On some systems, the memory page size can be modified by user programs, e.g. IBM SP, HP machines Avoid large stride in memory access; Sequential access to memory as much as possible. 21 Interleaved Memory Memory interleaving: alleviating the impact of memory cycle time. Total memory divided into a set of memory banks; Contiguous memory addresses reside on different banks. When accessing memory sequentially, effect of memory cycle time minimized When current bank is busy, next bank is idle and can be accessed immediately. Stride in memory access not favorable may access the same bank repeatedly, need to wait due to cycle time poor performance 0-31 32-63 64-95 96-127 128-159 160-191 192-223 224-255 … … … … Bank 1 Bank 2 Bank 3 Total 2GB memory Divide into 4 memory banks Each bank: 512 MB Cache line: assumed 32 bytes 1 cache line (32 bytes) Bank 4 22