Memory Operation and Performance To understand the memory architecture so that you could write programs that could take the advantages and make the programs faster. This lecture covers: Memory Systems Caches Virtual Memory (VM) Caches – fast memory between CPU and main Memory Cache Design Parameters A Diversity of Caches , different level of caches Looking at the Caches Cache-aware Programming (Column-major and row-major, row-major will make the program faster.) A Diversity of Caches Multiple Levels of Caches (L1, L2, or L3 cache, L1 is faster than L2 and L2 is faster than L3. L1 is more expensive than L2..) On-Chip Caches (only one level cache, L1) Instruction and Data Caches (separate data and instruction in the cache. There is likely that instruction will be reused later.) Instruction and Data Cache – from http://www.kids-online.net/learn/clickjr/details/cpu.html Right is two photos of a CPU (Central Processing Unit). The photo on the bottom is the CPU chip from the outside. The photo on top is a large road map of the inside of the CPU, showing data cache and instruction cache. Multiple Levels of Caches Modern computer systems don't have just one cache between the CPU and memory. There is usually a hierarchy of caches. The caches are usually called L1, L2, etc.—which is shorthand for Level 1, Level 2, and so on. An L1 cache is the cache that is within the CPU, and is, therefore, the fastest and smallest, but more expensive The last cache (usually L2 or L3) is the cache that loads data directly from the DRAM main memory, less expensive L1 and L2 Cache Level 1 cache memory, is memory that is included in the CPU itself. Level 2 cache memory, is memory outside of the CPU. Photo below shows level 2 cache memory on the Processor Example – shows the benefit of cache memory of different levels /* Assumes n is a power of two */ into two void merge_sort (int * data, int n) { int half = n >> 1; if (n == 1) return; binary_sort(data, half); binary_sort(data + half, half); merge(data, data + half, half); } // no need to memorise It is to divide the data Graph of Merge Sort the access times in nanoseconds (ns) for the L1 cache (T1), L2 cache (T2), L3 cache (T3), and main memory (Tm). Merge sort- in a fast small cache (in L1 only) Look at the total time Merge sort - in a slow cache (in L3 only) Look at the total time, it is longer On-Chip Caches multilevel caches can improve the performance of a computer. However, usually there is no major difference between having a single L3-sized cache and three caches It is not as significant as the difference between a single large cache and a single small one Instruction and Data Caches Programs access and fetch instructions in much more predictable ways than they do data. For instance, instruction fetches exhibit much more spatial locality than data, because it is very likely that an instruction fetch will be soon followed by the fetch of the instruction next to it. For example, the program is executing a++, there is high chance to execute b++ and c= a + b*3; a++; b++; c =a+ b*3; Even when a branch or jump instruction makes this untrue, it is very likely that the instruction fetched next will be one that has already been fetched recently. multiple levels of caches Note that L1 is within CPU chip Looking at the cache design We can deduce many things about the cache design of a particular computer by carefully examining its memory performance. We can design a benchmark program whose locality we control such as. int data[MAXSIZE]; for (i = 0; i < repeat; i++) { for (i = 0; i < N; i++) { dummy = data[i]; } } Explanation to the program This loop accesses a chunk of memory repeatedly. By varying N, we vary the temporal locality of the accesses. For example, for N == 4, each of the values data[i] will be accessed every 4 iterations, but if N is 16, each data[i] will be accessed only every 16 iterations. A cache of size 16 would cause the benchmark to perform much more poorly for N == 32 than for N == 8, because for N == 32, each data[i] would have been evicted (means removed) from the cache before it was accessed again. Control the spatial locality Here, stride controls the amount of spatial locality int data[MAXSIZE]; for (i = 0; i < repeat; i++) { for (i = 0; i < N; i += stride) { dummy = data[i]; } } Result of benchmark You can see the performance Is not Transfer rate in MB/s proportional to L1 cache. Why? It is effective between 4K and 512 bytes Interpretation of result We immediately notice that memory performance comes in three discrete steps. In the best performing step, the program is accessing so little data that all of its references fit in the L1 cache, and the rest of the hierarchy is almost never required. In the next step down, the references no longer fit in the L1 but fit in the L2 cache, and access to main memory is almost never required. Try to fit L1, L2 etc. Graph showing size of L1 and performance Performance: Transfer rate The effect of stride (steps) Cache-Aware Programming That is how to optimise the performance Instruction Cache Overflow Cache Collisions Unused Cache Lines Insufficient Temporal Locality Example (1) – 4ms (assume 1M) Example (2) – 3ms (assume 1M) Example (3) – 3 ms (assume 512K) Example (4) – 2.5 ms (assume 512K) Example (5) – 2.3 ms (Assume 256K) Example (5) – 2.3 ms Instruction cache - Program of complicated for/loop Below is a program involving three complicated operations: for (i = 0; i < MAX; i++) { <Complicated operation on A[i]> <Complicated operation on B[i]> <Complicated operation on C[i]> } It is better to separate into three So that each complicated operation can maximise the cache memory (instruction cache). for (i = 0; i < MAX; i++) { <Complicated operation on A[i]> } for (i = 0; i < MAX; i++) { <Complicated operation on B[i]> } for (i = 0; i < MAX; i++) { <Complicated operation on C[i]> } Cache Collisions Cache collisions can also cause our programs to execute slowly. a cache collision occurs when a cache line is evicted (switched out) even though the cache is not full. It happens when the line is full, the system has to decide which data line to remove (switch out). Program of cache collision Below is the program involving variables a and b. int a[N]; <other stuff...> int b[N]; <other stuff...> int c[N]; for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Reason of Cache Collision It is possible that the compiler may allocate a, b, and c to memory addresses that map to the same cache set. In this case, the assignment c[i] = a[i] + b[i] will cause three cache misses in every iteration of the loop, because the cache will be constantly evicting the cache line that the CPU requires next. This operation will cause three operations, as c[], a[] and b[] are in the same cache line. Graph showing the Cache collision The solution is to offset memory location #define CACHELINESIZE <Cache line size of system> #define COFFSET ((2 * CACHELINESIZE) / sizeof(int)) int a[N]; <other stuff...> int b[N]; <other stuff...> int int c[N + COFFSET]; for (i = 0; i < N; i++) { c[i + COFFSET] = a[i] + b[i]; Graph showing cache after change Under-used Cache Lines Suppose the cache line is 32 bytes wide, as it often is. If a program is reading contiguous 4-byte integers (continuous), the reference to the first will cause the first eight integers (integers 0–7) to be loaded into the cache. The reference to the 9th will cause integers 8–15 to be loaded, and so on. The hit ratio, even on a cold cache, will be at least 7/8, or 0.875. Now consider a program that reads integers with a stride of eight or more. This means that the program reads the first integer, then the 9th (or higher), then the 17th etc. Graph showing the effect Cache miss Example of a matrix int data[M][N]; for (i = 0 ; i < N; i++) { for (j = 0; j < M; j++) { sum += data[j][i]; } } Row-major and Column-major Accessing a column-major Accessing row data It will be faster, as it accesses [0,0], [0,1][0,2] which will be loaded into cache line after reading [00] up to [13] Changing the order of the iterations is not always better. Below is an example. //It is because we have fixed transposed[i][j], but not original [j][i] int original[M][N]; int transposed[N][M]; for (i = 0; i < M; i++) { for (j = 0; j < N; j++) { transposed[i][j] = original[j][i]; } } Effect of rotating shape This is the effect of previous program. It is to rotate the image. Insufficient Temporal Locality int original[M][N]; int transposed[N][M]; for (k = 0; k < M / m; k++) { for (l = 0; l < N / n; k++) { for (i = k*m; i < (k+1)*m; i++) { for (j = l*n; j < (l+1)*n; j++) { transposed[i][j] = original[j][i]; } } } } Blocked transpose gets around cache misses m and n must be a square and is determined by the cache line size, say 32 bytes. So that it will fit into the cache . Virtual Memory (VM) The term virtual memory refers to a combination of hardware and operating system software that solve several computing problems. It receives a single name because it is a single mechanism, but it meets several goals: To simplify memory management and program loading by providing virtual addresses. To allow multiple large programs to be run without the need for large amounts of RAM, by providing virtual storage. Virtual Addresses Segmentation – group pages together with different size Memory Protection – due to the support of more than ONE process, to protect the memory being corrupted by others Paging – use the same size in disk and memory and load it into memory or from memory to disk. But computers hold several programs in memory at the same time. Virtual Memory - Explanation Sequence of virtual memory. Program size is larger than main memory. Memory Disk Page contradictory about VM facts: The compiler determines the address at which a program will execute, by hard-wiring a lot of addresses of variables and instructions into the machine code it generates. The location of the program is not determined until the program is executed and may be anywhere in main memory. Solution to contradictory facts Code Relocation: Have the compiler generate addresses relative to a base address, and change the base address when the program is executed. This means that the address of each reference is calculated explicitly by adding the relative address to the base address. Drawback: Address Translation: At run time, provide programs the illusion that there are no other programs in memory. Compilers can then generate any absolute address they wish. Two programs may contain references to the same address without interference. Virtual and Physical Addresses The addresses issued by the compiler are called virtual addresses. The addresses that result from the translation are called physical addresses, because they refer to an actual memory chip. Multiple programs without relocation Program A shares some memory locations belonging to Program B. Relocatable code can share memory Program A uses the memory locations belonging to itself. Summary Cache, L1 (within CPU), L2 and L3 Data cache and instruction cache Program: column major and row major, rowmajor can enhance the performance. Virtual memory: memory is too small to cater for the whole program. It loads the page into memory.