High Performance Computing Lecture 1

Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg 1 Memory  Bits: 0, 1; Bytes: 8 bits  Memory size  PB – 10^15 bytes; TB – 10^12 bytes; GB – 10^9 bytes; MB – 10^6 bytes; KB – 10^3 bytes  Memory performance measures:  Access time, or response time, latency: interval between time of issuance of memory request and time when request is satisfied.  Cycle time: minimum time between two successive memory requests Memory busy t0 < t < t2 t0 Memory request t1 request satisfied DRAM only t2 Access time: t1-t0 Cycle time: t2-t0 If there is another request at t0 < t < t2, memory is busy and will not respond; have to wait until t > t2 2 Memory Hierarchy Memory can be fast (costly) or slow (cheaper). Increase overall performance: use locality of reference Faster memory (also smaller) closer to CPU; slower memory (also larger) farther away from CPU. Have often-used data in fast memory; leave lessoften-used data in slow memory. Key: When lower levels of hierarchy send value at location x to higher levels, also send content at x+1, x+2, etc. i.e. send a block of data Cache line 3 Memory Hierarchy Registers Cache: a piece of fast memory Expensive, CA$H ? Level-1 cache Increasing speed Increasing cost Decreasing size Level-2 cache Decreasing speed Decreasing cost Increasing size Main memory Secondary memory (hard disk) Network storage ……  Performance of different levels can be very different  e.g. access time for L1 cache can be 1 cycle, L2 can be 5 or 6 cycles, while main memory can be dozens of cycles and secondary memory can be orders of magnitude slower. 4 How Memory Hierarchy Works (RISC processor) CPU works only on data in registers. If data is not in register, request data from memory and load to register … Data in register come only from and go only to L1 cache. When CPU requests data from memory, L1 cache takes over; If data is in L1 cache (cache hit), return data to CPU immediately; end memory access; If data is not in L1 cache (cache miss) … 5 How Memory Hierarchy Works  If data is not in L1 cache, L1 cache forwards memory request down to L2 cache.  If L2 cache has the data (cache hit), it returns the data to L1 cache, which in turn returns data to CPU; end memory access;  If L2 cache does not have the data (cache miss) …  If data is not in L2 cache, L2 cache forwards memory request down to main memory.  If data is in main memory, main memory passes data to L2 cache, which then passes data to L1 cache, which then passes data to CPU.  If data is not in memory …  Then request is passed to OS to read data from secondary storage (disk), which then is passed to memory, L2 cache, L1 cache, register. 6 Cache Line  A cache line is the smallest unit of data that can be transferred to or from memory (and L2 cache).  usually between 32 and 128 bytes  May contain several data items  When L2 cache passes data to L1 cache, or when main memory passes data to L2 cache, a cache line, instead of a single piece of data, is transferred.  When the data in variable X is requested from memory, the cache line containing X (and adjacent data) is transferred to cache. Assume: 32-byte cache line, X[11] is requested by CPU Result: X[10] – X[13] is brought into cache from memory. X[9] X[10] X[11] X[12] Cache line X[13] X[14] Cache line 7 Cache Effect on Performance Cache miss  degrading performance When there is a cache miss, CPU is idle waiting for another cache line to be brought from lower level of memory hierarchy Increasing cache hit rate  higher performance Efficiency directly related to reuse of data in cache To increase cache hit rate, access memory sequentially; avoid strides, random access, and indirect addressing in programming. for(i=0;i<100;i++) y[i] = 2*x[i]; sequential access for(i=0;i<100;i=i+4) strides y[i] = 2*x[i]; for(i=0;i<100;i++) y[i] = 2*x[index[i]]; Indirect addressing 8 Where in Cache to Put Data from Memory Cache is organized into cache lines. Memory is also logically organized into cache lines. Memory size >> cache size 1 MB (32,768 cache lines) 32-byte cache line cache … Number of cache lines in memory >> number of cache lines in cache. Many cache lines in memory correspond to one cache line in cache. … Main memory 2 GB (67,108,864 cache lines) 9 Cache Classification Direct-mapped cache Given a memory cache line, it is always placed in one specific cache line in cache. Fully associative cache Given a memory cache line, it can be placed in any of the cache lines in cache. N-way set associative cache Given a memory cache line, it can be placed in any of a set of N cache lines in cache. 10 Direct-Mapped Cache  A set of memory cache lines always correspond to exactly the same cache line in cache.  Cheap to implement in hardware;  May cause cache thrashing: repeatedly displacing and loading cache lines. 8 KB Line-Index = Mod (mem-cache-line-index, tot-cache-lines-in-cache) … … 0 … 8K … 16K … … 2G 11 Cache Thrashing: Example Assumptions: Direct-mapped cache; Cache size: 1 MB; Cache line: 32 bytes; 1 double value = 8 bytes 131072 double values = 1 MB 1 cache line = 32 bytes = 4 double values X[131072]: 1 MB memory Y[131072]: 1 MB memory double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; … 12 Cache Thrashing: Example i=0: load line X[0]-X[3] into cache; load X[0] from cache to register; load line Y[0]-Y[3] into cache, displacing line X[0]-X[3]; load Y[0] from cache into register; add, update Y[0] in cache; i=1: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[1] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; load Y[1] from cache to register; add, update Y[1] in cache; i=2: load X[0]-X[3] into cache, displacing Y[0]-Y[3], write line Y[0]-Y[3] back to memory; load X[2] from cache to register; load Y[0]-Y[3] into cache, displacing X[0]-X[3]; X[0] X[1] X[2] X[3] load Y[2] from cache to register; X[4] X[5] X[6] X[7] add, update Y[2] in cache; … … … … i=3: … No cache reuse! Poor performance! Avoid cache thrashing! … double X[131072], Y[131072]; long i, j; // initialization of X, Y … for(i=0;i<131072;i++) Y[i] = X[i] + Y[i]; … … … 1 MB 32768 lines … Y[0] Y[1] Y[2] Y[3] Y[4] Y[5] Y[6] Y[7] … … … … … … … … … … … … … … … … cache Memory 1 MB 32768 lines 13 Fully Associative Cache A cache line from memory can be placed anywhere in cache; No cache thrashing; but costly. Direct-mapped cache at one extreme of spectrum; fully associative cache at another extreme of spectrum. Disadvantage: search entire cache to determine if a specific cache line is present. 14 N-Way Set Associative Cache  Compromise between direct-mapped cache and fully associative cache  The cache lines in cache is divided into a number of sets; Each set contains N cache lines.  Given a cache line from memory, the index of set it belongs to is first calculated; Then it is placed in one of the N cache lines in this set. 2-way set associative cache cache Main memory … 1 MB 32,768 cache lines 16,384 sets Each set has 2 lines … 2 GB (67,108,864 cache lines) Direct-mapped cache is 1-way set associative cache; Fully associative cache is N_c way set associative cache; N_c is total number of cache lines in cache. Less likely to cause cache thrashing; Less costly; 15 Instruction/Data Cache CPU may have separate instruction cache and data cache (split cache). CPU may have a single cache, for both instructions and data from memory (unified cache). 16 Remember … Efficiency directly related to cache reuse Cache thrashing is eliminated by padding arrays (array dimensions should not be a multiple of cache line – avoid powers of 2) To improve cache reuse, Access memory sequentially as much as possible Avoid stride, random access, indirect addressing Avoid cache thrashing. 17 Example X[0][0] double X[1024][1024], Y[1024][1024]; int i,j; … for(j=0;j<1024;j++) for(i=0;i<1024;i++) X[i][j] = Y[i][j]; X[0][1] …… stride 1024 or 8KB X[0][1023] X[1][0] X[1][1] …… X[1023][1023] Y[0][0] Y[0][1] Large stride in memory access pattern results in not only cache miss/poor reuse, but also TLB miss. …… Y[0][1023] Y[1][0] Y[1][1] …… Y[1023][1023] 18 4GB 2GB …… 1048KB Program #1 Modern computers use virtual memory; …… 1044KB 8KB 4KB 1040KB 0 1036KB 2GB 1032KB Program #2 …… 1028KB 8KB 1024KB 4KB …… 0 0 Virtual Memory, Memory Paging Physical Memory Memory address seen in a program (virtual address) is not the actual address in physical memory; Memory is divided into pages (e.g. 4KB); A memory page in program’s address space corresponds to a page in physical memory; To access memory, need to translate program’s virtual address to the actual address in physical memory. This is done using a page table; 19 Translation Look-aside Buffer (TLB)  TLB is a special cache for the page tables  Faster access to TLB for virtual-physical translation.  When program accesses a memory location, the translation between virtual and physical pages is loaded into TLB (if it is not already there);  If program exhibits locality of references, entries in TLB can be reused TLB hit  better performance  Otherwise  TLB miss  performance degrades.  Large stride in memory access pattern  TLB miss (and cache miss). 20 Remedies Use large memory page size On some systems, the memory page size can be modified by user programs, e.g. IBM SP, HP machines Avoid large stride in memory access; Sequential access to memory as much as possible. 21 Interleaved Memory  Memory interleaving: alleviating the impact of memory cycle time.  Total memory divided into a set of memory banks;  Contiguous memory addresses reside on different banks.  When accessing memory sequentially, effect of memory cycle time minimized  When current bank is busy, next bank is idle and can be accessed immediately.  Stride in memory access not favorable  may access the same bank repeatedly, need to wait due to cycle time  poor performance 0-31 32-63 64-95 96-127 128-159 160-191 192-223 224-255 … … … … Bank 1 Bank 2 Bank 3 Total 2GB memory Divide into 4 memory banks Each bank: 512 MB Cache line: assumed 32 bytes 1 cache line (32 bytes) Bank 4 22

High Performance Computing Lecture 1

Related documents

Products

Support

High Performance Computing Lecture 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib