University of Massachusetts, Dartmouth CIS 570 Survey Paper Fall 2010 Memory Hierarchies-Basic Design and Optimization Techniques Submitted by Rahul Sawant, Bharath H. Ramaprasad Sushrut Govindwar, Neelima Mothe Computer and Information Science Department. Survey on Memory Hierarchies – Basic Design and Cache Optimization Techniques Abstract - In this paper we provide a comprehensive survey of the past and current work of Memory hierarchies and optimizations with a focus on cache optimizations. Firstly we discuss various types of memory hierarchies and basic optimizations possible. Then we shift our focus on cache optimizations and discuss the motivation for doing this survey on the same. Then we discuss different types of cache memories and their mapping policies. Further to avoid various categories of cache misses and achieve high performance and low energy consumption we discuss different basic and advance cache optimizations. Moving further we’ll be discussing about other basic and advance cache optimizations techniques like Trace caches [12] and other optimization techniques. We then compare between the various cache optimizations discussed above and also discuss a few novel memory hierarchies and cache optimizations deviating from the conventional hierarchies like the Dynamic Memory Hierarchy Performance Optimization [6] and Way- Predicting Set-Associative cache for High Performance and Low energy consumption [7]. Lastly we discuss few open and challenging issues faced in various cache optimization techniques. Index Terms – Column Associative caches, Cache Misses, Temporal, Spatial, Multi-Level Cache, Blocking INTRODUCTION Most modern CPUs are so fast that for most program workloads, the bottleneck is the locality of reference of memory accesses and the efficiency of the caching and memory transfer between different levels of the hierarchy. As a result, the CPU spends much of its time idling, waiting for memory I/O to complete. This is sometimes called the space cost, as a larger memory object is more likely to overflow a small/fast level and require use of a larger/slower level. This is where memory hierarchies come into picture. A memory hierarchy in computer storage distinguishes each level in the hierarchy by response time. Since response time, complexity, and capacity are related, the levels may also be distinguished by the controlling technology. A Conventional Memory Hierarchy is as depicted in Figure 1 as shown below. The problem of design of memory hierarchies and its optimizations has been heavily researched in the past couple of decades. Thus the basic problem statement is how to design a reasonable cost memory system that can deliver data at speeds close to the CPU’s consumption rate. The most obvious answer to that is to design a memory hierarchy with slow (inexpensive, large size) components at higher levels and with fastest (most expensive, smallest) components at the lowest level[12]. Cache optimizations are required to overcome various types of cache misses such as compulsory, capacity and conflict across various mapping policies which improves the performance of the cache and brings the memory access latencies down which in turn speeds up the memory fetches, minimizes cache misses, miss penalty rate such that the memory system can deliver data near to the CPU clock cycle rate. Figure 1. Conventional Memory Hierarchy ITape Disk Main Memory Cache (L2) D-Cache I-Cache CP U Motivation The Plot above depicts the advancement in technology trend of Memory and CPU performance and maturity shows that the gap between CPU and memory curves are far apart from one another indicating a serious need in improvement and advancement in memory technologies and optimizations. Different areas of applications in recent past have required techniques which promise high performance and low power consumption which have a vast usage in Content Delivery Networks(CDN), Grid Networks, Gaming Applications. Thus it is at most essential to look into the issues of memory hierarchies, cache mapping policies and is the motivation for our survey on the topic. SURVEY The survey will discuss basic optimization techniques such as column-associative caches[1] and Predictive Sequential Associative Cache[2] for achieving higher associativity to reduce miss rate. Then improving direct-mapped cache performance by adding a small fully- associative cache and prefetch buffers [3] and Characteristics of Performance-Optimal Multi- Level Cache Hierarchies[4] for achieving Multi-level caches to reduce miss penalty. Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors for achieving larger block sizes to reduce miss rate [11]. Further the Cache Performance and Optimization of Blocked Algorithms[5],[3] is discussed which promises to use Larger caches to reduce miss rate. Types of Basic Cache Optimizations 1. Larger block sizes to reduce miss rate Miss rate reduction represents the classical approach to improving cache behavior, therefore has received a lot of interest over last decade. The table (Table-1) below shows that the variation in using different sizes of the blocks. It is shown that larger the block size, the cache size increases. There are some advantages and Disadvantages of this technique. They are stated as below. Advantages 1. Emphasize spatial locality of data/ instructions; 2. Reduction in compulsory misses Disadvantages 1. Increasingly sensitive to miss penalty as block size increases; 2. Conflict misses increase. Table 1 – Average memory access time against block size for 5 cache sizes Block Miss Size Cache Size Penalty 4K 16K 4.231 16 82 8.027 32 84 1.558 64 88 1.447 128 96 8.469 3.659 256 112 11.961 4.685 7.082 7.160 64K 256K 2.673 3.411 1.894 2.134 3.323 1.933 1.979 1.470 2.288 1.549 2. Higher associativity to reduce miss rate Higher Associativity is one of the good techniques to reduce the miss rate. Consider there are ten places to which the replacement policy could have mapped a memory location. It should check in that location is in the cache. Checking more places takes more power, space, and time also. On the other hand, caches with more associativity suffer fewer misses, so that the CPU wastes less time reading from the slow main memory. The rule of thumb is that doubling the associativity, from direct mapped to 2-way, or from 2-way to 4-way, has about the same effect on hit rate as doubling the cache size. Associativity increases beyond 4-way have much less effect on the hit rate. In order of increasing hit times and decreasing miss rates, direct mapped cache:- the best (fastest) hit times 2-way set associative cache 2-way skewed associative cache 4-way set associative cache Fully associative cache: – the best (lowest) miss rates. Example: assume CCT = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CCT direct mapped Associativity Cache Size (KB) 1-way 2-way 4-way 8-way 1 2.33 2.15 2.07 2.01 2 1.98 1.86 1.76 1.68 4 1.72 1.67 8 1.46 1.48 1.47 1.43 16 1.29 1.32 1.32 1.32 32 1.20 1.24 1.25 1.27 64 1.14 1.20 1.21 1.23 128 1.10 1.17 1.18 1.20 1.61 1.53 (Red Color in the table above indicates A.M.A.T. not improved by more associativity) 3. Multi-level caches to reduce miss penalty The increasing speed of new generation processors will exacerbate the already large difference between CPU cycle times and main memory access times. As this difference grows, it wilI be increasingly difficult to build single-level caches that are both fast enough to match these fast cycle times and large enough to effectively hide the slow main memory access times. One solution to address this problem is to use a multi-level cache hierarchy. The authors of this paper[4] examine the relationship between cache organization and program execution time for multi-level caches. They show that a first-level cache dramatically reduces the number of references seen by a secondlevel cache, without having a large effect on the number of second-level cache misses. This reduction in the number of second-level cache hits changes the optimal design point by decreasing the importance of the cycle-time of the second-level cache relative to its size. The lower the first-level cache miss rate, the less important the second level cycle time becomes. This change in relative importance of cycle time and miss rate makes associativity more attractive and increases the optimal cache size for second-level caches over what they would be for an equivalent single-level cache system [4]. They then infer that multi-level hierarchy design problem is substantially more complex than the single-level case. Not only is there an additional set of design decisions introduced for each level in the hierarchy, but the optimal choices at each level depend on the characteristics of the adjacent caches. The direct influence of upstream caches is obvious in the miss ratio simulations that they plot. Multi-level caches provide one means of dealing with the large difference between the CPU cycle time and the access time of the main memory [4]. By providing a second level of caching, one can reduce the cost of the first level misses, which in turn improves the overall system performance. Furthermore, by reducing the L1 cache miss penalty, the optimal Ll size is also reduced, increasing the viability of high-performance RISC CPUs coupled with small, short cycle time L1 caches. Advance Cache Optimization Techniques 1. Way Prediction Many modern processors use set-associative caches as L1 or L2 caches. Increase in cache associatively increases the hit rate. This result in longer cache access time as there is a delay for way selection. To solve this problem, researchers proposed set-associative caches with way prediction. One approach of reducing Hit Time is way prediction. Using way prediction, high performance is achieved and energy consumption is relatively low for set-associative caches. Instead of accessing all the ways in a set, only one cache way is predicted, by which energy consumption is reduced. It also improves the energy-delay(ED) product by 60-70% as compared to conventional set-associative cache [18]. Based upon speculations, way-predicting cache selects one way before it starts a normal cache access. By doing so, way prediction is accurate, and energy consumption can also be relatively reduced without any performance degradation. Figure 1.a shows how the way-predicting cache accesses the predicted way. If the prediction is correct, the cache access has been completed successfully. Otherwise, the cache searches for the remaining ways as shown in figure 1.b. [18] 1(a) Prediction Hit 1(b) Prediction Miss Figure 1: Way-Predicting Set Associative Cache When a way is predicted, as shown in fig 1(a), the way-predicting cache consumes energy for activating the predicted way only. And also the cache access can be completed in one cycle. But on a miss, the access time of way-predicting cache increases due to successive process of two phases as shown in fig 1(b). It results in more energy consumption. So the performance and energy efficiency of way-predicting cache largely depends on how accurately a way is predicted [18]. 2. Multi-Bank Cache A multi-bank cache, shown in figure 2(a), uses a crossbar interconnection to distribute the memory reference stream among multiple cache banks. Each bank of a multi-bank cache can independently service one cache request per cycle, and each bank has a single port. This technique delivers high bandwidth access as long as simultaneous accesses map to independent banks. This design has a lower latency and area requirement than the replicated cache, especially for large cache sizes. Its small, single ported banks are less costly and faster, while its crossbar interconnects has a high cost. Figure 2(b) Multi-Bank Cache For mapping reference addresses onto corresponding cache banks, a bank selection function is necessary this function can affect the bandwidth delivered by the multi-bank implementation since it influences the distribution of the access to the bank. An inefficient function may increase bank conflicts, reducing the delivered bandwidth. Efficient functions must be weighed against implementation complexity and the possibility of lengthening the cache access time. This therefore renders accurate but complex selection functions highly unattractive for cache design. [19] 3. Pipelined Cache A study was done during the design of microprocessor based on the MIPS instruction set architecture, that is to implemented in GaAs direct-coupled FET logic with multi-chip module packaging. Figure 3 shows a processor with ranges of parameters. The two level cache organizations is necessary to hide the large disparity in speeds between the processor and main memory. The level one (L1) cache is split in to instruction (L1-I) and data (L1-D) halves to provide an instruction or data access every cycle. [20] Figure 3: The access paths of pipelined instruction cache Increasing the number of cache pipeline stages can reduce the CPU cycle time, but it will increase the number of load and branch delay cycles due to pipeline hazards. If no usefull instructions are executed during these delay cycles they will increase the number of cycles a program takes to execute, potentially decreasing the performance of pipelining. Research shows that as many as three load or branch delay cycles can be hidden by static compile-time instruction or by dynamic-based methods. Static schemes are more effective at hiding branch delay cycles, but dynamic methods are more effective at filling load delay slots. By combining these schemes, caches with two to three pipeline stages have higher performance than caches with less pipeline stages. [20] 4. Non-Blocking Cache Non-Blocking caches is used for hiding memory latency by exploiting the overlap of processor computation with data access. A non-blocking cache allows execution to proceed concurrently with cache misses as long as dependency constraints are observed. The research took following considerations: 1) load operations are non-blocking, 2) write operations are non-blocking and 3) the cache is capable of servicing multiple cache miss requests. In order to allow these non-blocking operations and multiple misses, the author introduced Miss Information/ Status Holding Registers (MSHRs) that are used to record the information pertaining to the outstanding requests. Each MSHR entry includes the data block address, the cache line of the block, the word in the block which caused the miss and the functional unit or register to which the data is to be routed. [21] Non-blocking loads requires extra support in the execution unit of the processor in addition to the MSHRs associated with a non-blocking cache. If static instruction scheduling in pipelines is used in the processor, some form of register interlock is needed for preserving correct data dependencies. A consistency problem can arise when processor allows non-blocking writes since a later read may be needed before a previous buffered write is performed. Is these two operations are on the same data block, an associative check in the write buffer or the MSHRs must be done to provide the correct value to the following read. [21] 5. Software prefetching Software prefetching is one of the better techniques for improving the performance of the memory subsystem to match today’s high-performance processors. A compiler algorithm is used to insert prefetch instructions in scientific codes that operate on dense matrices. The algorithm issues prefetch only to the references those are most likely to have cache misses. By generating fully functional code, the algorithm measures the improvements in cache miss rates. The algorithm also improves the execution speed of the programs. [22] There are few important concepts for developing these prefetch algorithms. Prefetches are possible only if the memory addresses can be determined ahead of time. Prefetches are unnecessary if the data are already in the cache at the time of prefetch. [22] 6. Hardware Prefetching Several schemes have been proposed for prefetching that are strictly hardware-based. Poterfield evaluated several cache line-based hardware prefetching schemes. These schemes are quite effective at reducing miss rates, but at the same time they often increase memory traffic. Lee proposed a scheme for prefetching a multiprocessor where all shared data is unchangeable. He found that these schemes are effective by branch prediction and synchronization. Baer and Chen proposed a scheme that uses a history buffer to detect strides. In this scheme a “look ahead PC” runs through a program ahead of the normal PC using branch prediction. When the look ahead PC finds a matching stride entry in the table, it issues a prefetch. They evaluated the scheme in a memory system with 30 cycle miss latency and found good results. [22] Some of the advantages of software prefetching are; better dynamic information that can recognize things such as unexpected cache conflicts that are difficult to predict in the compiler and they do not add any instruction overhead to issue prefetches. [22] It has a disadvantage of having difficulty in detecting the memory access patterns. [22] Cache Optimization Algorithms:Belady's Algorithm The most efficient caching algorithm would be to always discard the information that will not be needed for the longest time in the future. This optimal result is referred to as belady optimal algorithm. This algorithm generally not implemented in practice because of one disadvantage that this algorithm is impossible to predict how far in the future information will be needed. Least Recently Used Discards the least recently used items first. This algorithm requires keeping track of what was used when, which is expensive if one wants to make sure the algorithm always discards the least recently used item. General implementations of this technique require keeping "age bits" for cache-lines and track the "Least Recently Used" cache-line based on age-bits. In such implementation, every time a cache-line is used, the age of all other cache-lines changes. LRU is actually a family of caching algorithms. Most Recently Used This algorithm discards (contrast to LRU), the most recently used items first. According to [15] when a file is being repeatedly scanned in a reference pattern, MRU is the best replacement algorithm" some author made point that for random access patterns and repeated scans over large datasets MRU cache algorithms have more hits than LRU due to their tendency to retain older data[16]. Pseudo-LRU For caches with large associativity (generally >4 ways), the implementation cost of LRU becomes prohibitive. If a scheme that almost always discards one of the least recently used items is sufficient, the PLRU algorithm can be used which only needs one bit per cache item to work. Segmented LRU The modification of algorithm LRU is comes into picture. Which is called “Segmented LRU (SLRU)”. In SLRU cache is divided into two segments. 1. Probationary segment 2. Protected segment (Finite) Lines in each segment are ordered from the most to the least recently accessed. Data from misses is added to the cache at the most recently accessed end of the probationary segment. Hits are removed from wherever they currently reside and added to the most recently accessed end of the protected segment. The protected segment is finite. So migration of a line from the probationary segment to the protected segment may force the migration of the LRU line in the protected segment to the most recently used (MRU) end of the probationary segment, giving this line another chance to be accessed before being replaced. The size limit on the protected segment is an SLRU parameter that varies according to the I/O workloadpatterns. Whenever data must be discarded from the cache, lines are obtained from the LRU end of the probationary segment[17]. Research Work in the recent years 1. Analytical Model for Hierarchical Cache Optimization in IPTV Network [8] In an IPTV network, Video on Demand and other video services generate large amount of unicast traffic from Video Hub Office (VHO) to subscribers and, therefore, require additional bandwidth and equipment resources in the network. To reduce this traffic and overall network cost, a portion of the video content (the most popular titles) may be stored in caches closer to subscribers, e.g., in a Digital Subscriber Line Access Multiplexer (DSLAM), a Central Office (CO), or in an Intermediate Office (IO). The problem becomes to minimize cost by optimizing cache memory placement and amount. They then present an analytical model of hierarchical cache optimization. This model depends on several basic parameters: traffic volume, cache hit rate as a function of memory size, topology (i.e. number of DSLAMs, service routers at CO, service switches at IO locations), and cost parameters. Some reasonable assumptions about a network cost structure and a hit rate function allow us to obtain an analytically optimal solution for the problem. The authors then analyze the factors that affect this solution[8]. 2. Peer Assisted Video Streaming With Supply-Demand-Based Cache Optimization. [9] The Authors consider a hybrid P2P video on-demand architecture that utilizes both the server and the peer resources for efficient transmission of popular videos. They propose a system architecture wherein each peer dedicates some cache space to store a particular segment of a video file as well as some of its upload bandwidth to serve the cached segment to other peers. Peers join the system and issue a streaming request to a control server. Control server directs the peers to streaming servers or to other peers who have the desired video segments[9]. Control server also decides which peer should cache which video segment. Their main objective is to determine the proper caching strategies at peers such that minimization of the average load on the streaming servers is possible. To minimize the server load, the authors pose the caching problem as a supply-demand-based utility optimization problem. By exploiting the inherent structure of a typical on-demand streaming application as well as the availability of a global view on the current supply-demand at the control server, we demonstrate how the system performance can be significantly improved over the brute-force caching decisions. Further they mainly consider three caching mechanisms. In the first mechanism cache prefetching, a segment is pre-fetched to a given peer for caching purposes upon peer's arrival to the system regardless of whether that segment is currently demanded by that peer or not. In the second mechanism (opportunistic cache update), a peer has the option of replacing the segment that is currently in its cache with the last segment that it finished streaming. In the third mechanism, we combine both mechanisms as a hybrid caching strategy. In particular, they find that a dynamic-programming (DP)-based utility maximization solution using only the cache update method performs significantly better in reducing the server load. Furthermore, our findings suggest that even less sophisticated cache update solutions can perform almost as good as prefetching strategies in interesting regions of operation. 3. Web Cache Optimization in Semantic based Web Search Engine [10] With the tremendous growth of information available to end users through the Web, search engines come to play ever a more critical role. Nevertheless, because of their general-purpose approach, it is always less uncommon that obtained result sets provide a burden of useless pages. The next-generation Web architecture, represented by the Semantic Web, provides the layered architecture possibly allowing overcoming this limitation. The ontology for multiple search engines is written such that in this search engine for single query the final result is got from multiple search engines. After getting the user query result we can use the clustering. In this clustering the user query results is formed in the a to z form, the several search engines have been proposed, which allow increasing information retrieval accuracy by exploiting a key content of Semantic Web resources, that is, relations. We can use web cache optimization in search engine to get fast retrieval of user query results. In this work I have used web cache optimization based on eviction method for semantic web search engine. In this paper, analization of both advantages and disadvantages of some current Web cache replacement algorithms including lowest relative value algorithm, least weighted usage algorithm and least unified-value algorithm is done. The authors propose a new algorithm, called least grade replacement (LGR), which takes recency, frequency, perfect history, and document size into account for Web cache optimization. They cite the work of Bahn et al. who proposed a web cache replacement algorithm called Least Unified Algorithm (LUV) that uses complete reference history of documents, in terms of reference frequency and recency. Disadvantage of Existing System: • Text based searching example (Google, yahoo, msn, Wikipedia). • Without semantic relationship to give exact result. • Query only focus single search engine. • Most existing search engines however, provide poor support to accessing the web results. • No analysis of stopping keywords from the user query. • It will not give relevant or exact result. • Number of iterations is high. • A replacement policy is required for replacing a page from web cache to make room for new page. To overcome these disadvantages a proxy server, which runs with mentioned features, which inherently helps speeder browsing of web pages with use of least grade page replacement algorithms. Conclusion A brief introduction on Memory Hierarchies design and optimization and the motivation to develop such techniques were discussed. Following which a detailed survey on Basic and Advanced cache optimization algorithms was discussed. Further the Cache Optimization Algorithms were discussed as well as applications of memory hierarchies and cache optimization techniques in different domains were discussed. References [1] Anant Agarwal and Steven D. Pudar ; “Column-Associative Caches: A Technique for Reducing the Miss Rate of Direct-Mapped Caches” A Agarwal… - ANNUAL INTERNATIONAL SYMPOSIUM ON …, 1993 - mprc.pku.cn [2] Calder, B.; Grunwald, D.; Emer, J.; ,"Predictive sequential associative cache," HighPerformance Computer Architecture, 1996. Proceedings. Second International Symposium on , vol., no., pp.244-253, 3-7 Feb 1996 doi:10.1109/HPCA.1996.501190 [3] Jouppi, N.P.; , "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," Computer Architecture, 1990. Proceedings., 17th Annual International Symposium on , vol., no., pp.364-373, 28-31 May 1990 doi: 10.1109/ISCA.1990.134547 [4] Steven Przybylski, Mark Horowitz, John Hennessy ;“Characteristics of Performance-Optimal Multi-Level Cache Hierarchies”.ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture ©1989 table of contents ISBN:0-89791-319-1 [5] Monica S. Lam, Edward E. Rothberg and Michael E. Wolf; “The cache performance and optimizations of blocked algorithms” ASPLOS-IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems. [6] Rajeev Balasubramonian , Dwarkadas , David Albonesi , Alper Buyuktosunoglu , Sandhya Hya Dwarkadas; “Dynamic Memory Hierarchy Performance Optimization” 2000 siteseer. [7] Koji Inoue et al.; “ Way-Predicting Set-Associative Cache for High Performance and Low Energy Consumption” ISLPED '99 Proceedings of the 1999 international symposium on Low power electronics and design. [8] Sofman, L.B.; Krogfoss, B.; , “Analytical Model for Hierarchical Cache Optimization in IPTV Network,” Broadcasting, IEEE Transactions on,vol.55 no.1,pp.62-70, March 09 [9] Kozat, U.C.; Harmanci, O.; Kanumuri, S.; Demircin, M.U.; Civanlar, M.R.; , “Peer Assisted Video Streaming With Supply-Demand-Based Cache Optimization,” Multimedia, IEEE Transactions on , vol.11, no.3, pp.494-508, April 2009 doi:10.1109/TMM.2009.2012918 [10] Dr.S.N.Sivanandam, Dr.M.Rajaram, S.Latha Shanmuga Vadivu, “Web Cache Optimization in Semantic based Web Search Engine” International Journal of Computer Applications (0975 – 8887)Volume 10– No.9, November 2010 [11] Dahlgren, Fredrik; Dubois, Michel; Stenstrom, Per; , "Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors," Parallel Processing, 1993. ICPP 1993. International Conference on , vol.1, no., pp.56-63, 16-20 Aug. 1993. [12] Weifeng Zhang; Checkoway, S.; Calder, B.; Tullsen, D.M.; , "Dynamic Code Value Specialization Using the Trace Cache Fill Unit," Computer Design, 2006. ICCD 2006. International Conference on , vol., no., pp.10 16, 1-4 Oct. 2007 [13] D. Patterson & J. Hennessey: Computer Architecture, A Quantitative Approach [14] Hennessy and Patterson, Computer Architecture a Quantitative Approach (3rd Ed ), Memory Hierarchy Design [15]Hong-Tai Chou and David J. Dewitt. An Evaluation of Buffer Management Strategies for Relational Database Systems. VLDB, 1985.]. [16] Shaul Dar, Michael J. Franklin, Björn Þór Jónsson, Divesh Srivastava, and Michael Tan. Semantic Data Caching and Replacement. VLDB, 1996. [17] Ramakrishna Karedla, J. Spencer Love, and Bradley G. Wherry. Caching Strategies to Improve Disk System Performance. In Computer, 1994. [18]Koji Inoue, Tohru Ishihara and Kazuaki Murakami, “Way-Predicting Set-Associative Cache for High Performance and Low Energy Consumption”. [19] Jude A. Rivers, Gary S. Tyson, Edward S. Davidson,”On HighBandwidth Data Cache Design for Multi-Issue Processors”. [20] Kunle Olukotun, Trevor Mudge and Richard Brown, “Performance Optimization of Pipelined Primary Caches” [21] Tien-Fu Chen and Jean-Loup Baer, “Reducing Memory Latency via Non-Blocking and Prefetcjing Caches” [22] Todd C. Mowry, Monica S. Lam and Anoop Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching”.