Cache Rationing for Multicore Jacob Brock and Chen Ding University of Rochester {jbrock, cding}@cs.rochester.edu Abstract traces As the number of transistors on a chip increases, they are used mainly in two ways on multicore processors: first, to increase the number of cores, and second, to increase the size of cache memory. The two approaches intersect at a basic problem, which is how parallel tasks can best share the cache memory. The degree of sharing determines the available cache resource for each core and hence the memory performance and scalability of the system. In this paper, cache rationing is presented as a cache sharing solution for collaborative caching. Categories and Subject Descriptors D.2.8 [Metrics]: Performance measures private cache shared cache p1 p2 p1 p2 p1 p2 a a a a b c a a a a b c x y z x x x x y z x x x m h h h m m m h h h m m m m m h h h m m m h h m h h h m m h h h h m m m m m h h h h m m h h h Figure 1. Compared to partitioning the cache, sharing the cache can change every third miss (‘m’) into a hit (‘h’), while using the same amount of total space (4 blocks). General Terms measurement, performance Cache sharing has a major shortcoming: it is risky since programs interfere with each other. One program may interfere and degrade the cache performance for all its peers. Keywords caching 2. 1. cache sharing, cache replacement policy, collaborative Basic Methods of Cache Sharing A basic solution to shared caching is partitioning. First, cache memory can be partitioned in hardware. Current multi-core processors have part of the cache memory dedicated to each core and the rest shared. For example, Intel Nehalem has 256KB L2 cache per core and 4MB to 8MB L3 cache shared by all cores. IBM Power 7 has 8 cores, with 256KB L2 cache per core and 32MB L3 shared by all cores. Shared cache can be partitioned using a software method called page coloring, which maps program data to a sub-space in cache. A recent, programmable system is ULCC [3]. Partitioned cache removes the interference between parallel tasks. However, it limits the resource utilization. If only one program is running, the private cache in all but one core is not utilized. In ULCC, insufficient knowledge of a parallel program or a user mistake can leave part of the shared cache unutilized. This is a basic limitation — partitioning is by allocation, and a program cannot use more than what is being allocated. In shared cache, the default sharing is by priority. A program may use the whole cache if no one else is running. Since a program does not use the same amount of data all the time, sharing the cache can be more beneficial than partitioning it. As an example, Figure 1 compares two programs and their hit/miss sequences when sharing a 4-block cache and when dividing it. The benefit of sharing is significant. In either program, every third miss in the partitioned cache becomes a hit in the shared cache. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MSPC’13, June, 2013, Seattle, Washington. Copyright c 2013 ACM 978-1-4503-1219-6/12/06. . . $10.00 Cache Rationing Neither of the previous sharing methods is ideal. Partitioning is robust but may be wasteful. Priority is efficient but volatile. The goal of cache rationing is to retain the benefit of sharing while controlling the interference. It combines two types of priorities: the original priority for cache sharing, which is controlled by hardware; and a second priority system for cache allocation, which is controlled by software. Next we describe the hardware interface, program transformation, and an overview of the multi-program cache rationing and show the result of an example test. 2.1 Collaborative Caching Cache rationing requires hardware support, in particular, a single bit at each memory instruction to control whether cache should store the accessed data. A limited form of such interface is the nontemporal stores on Intel machines, which have recently been used to reduce string processing and memory zeroing time [4, 7]. Other systems have been built or proposed for software hints to influence hardware caching. Earlier examples include the placement hints on Intel Itanium [1], bypassing access on IBM Power series [5], the evict-me bit of [6]. Wang et al. called a combined softwarehardware solution collaborative caching [6]. 2.2 Program Transformation Optimal caching is impossible to implement purely in hardware because it requires knowledge of future memory accesses. A solution, called Program Assisted Cache Management (Pacman), is to compute the optimal solution in a profiling run, and apply that solution to the actual run of the program [2]. When the program inputs change, a prediction is made for how the optimal solution will change for a given memory access. Figure 2 shows an overview of the system, and the three phases are outlined as follows: 1. Profiling Time After a training run, the memory access trace is profiled to determine the OPT distance at each access. Patterns are then identified in the sequence of OPT distances (which Training Inputs X, Y OPT Profiling and Pattern Analysis Pacman Optimized Executable Actual Input Z Pacman Optimized Executable Collaborative Caching on a Processor With LRU/ MRU Hints (Ideally) Optimal Cache Utilization 0.15 Compiler LRU/MRU Loop Splitting Figure 2. Overview of the Pacman system with OPT pattern profiling, LRU/MRU loop splitting, and collaborative caching 3. Run Time At run time, based on the cache size and the input, a Pacman-optimized loop determines the actual division of loop splitting to guide the cache with the LRU/MRU hints to approach OPT cache management. 2.3 Rationing the Shared Cache For the purpose of applying Pacman to multithreaded programs, we define a new policy for shared caches: cache rationing. The cache ration is the size of cache that Pacman assumes for each program in order to supply hints. As far as the hints are concerned, the cache is partitioned, but in practice the cache is shared. By promoting moderation in this way, cache rationing adds the safety of cache partitioning to the benefit of cache sharing. For example, for a cache of size 10, and two corun programs, the rationing might be 2:3 so that the Pacman threshold for Program A is 4, and for Program B it is 6. Although the two programs read and write on the same cache, each block of Program A with a predicted forward OPT stack distance over 4 is tagged for MRU eviction; each block of Program B with a predicted OPT stack distance over 6 is tagged for MRU eviction. Cache rationing was chosen over cache partitioning because the possibility for a cache “sipping” program to allow greater cache usage of its corun partner was hypothesized to result in fewer misses, and Figure 3 bears this out. Figure 3 compares Pacman cache rationing to Pacman cache partitioning with a streaming application and an implementation of successive over-relaxation (SOR), which has been used to solve linear systems of equations. The program traces were interleaved with 4 SOR references for each Streaming reference, but the results ● ● 0.05 ● ● ● ● ● ● ● ● ● ● ● 0.00 2. Compile Time Based on the patterns, a compiler performs loop splitting so that at run time, the loop iterations can be divided into groups where each memory access in the loop is tagged for either LRU or MRU eviction (MRU if the patterns suggest it would not be cached under OPT). ● Miss Ratio represent the smallest cache a memory block will go into under OPT) for use at compile time to determine the best cache hints. With multiple training runs using different input array sizes, patterns for other inputs can be extrapolated. ● Streaming rationed Streaming partitioned SOR rationed SOR partitioned 0.10 Instrumented (Loop) Code are nearly identical for 1:1 or 1:4 interleaving. The cache was rationed between SOR and Streaming from 10:90 to 90:10. In all tests, rationing shows only very modest improvement over partitioning for Stream. For SOR, however, miss ratios are nearly halved for some cache sizes, almost to the miss ratio for having the whole cache. In this range of cache sizes, Pacman with rationing is approximately as effective as LRU with partitioning, but in cases where Pacman outperforms LRU (see [2]) Pacman will likely outperform LRU with either partitioning or sharing. This example demonstrates the two potential advantages of Pacman cache rationing to manage program co-run: safety and sharing. Co-run programs will not encroach on each other’s rations, but a program can still utilize the available cache space when it is not being used or rationed by other programs. 0.0 0.2 0.4 0.6 0.8 1.0 Cache Ration (MB) Figure 3. Pacman miss ratios for cache rationed and cache partitioned runs of Streaming and SOR. Partitioned-cache points are plotted for cache sizes of 0.125MB, 0.25MB, 0.5MB, and 1MB. For rationed-cache points, the horizontal axis represents the cache ration as a percentage of the total 1MB cache. The same traces were used for training and testing. References [1] K. Beyls and E. D’Hollander. Generating cache hints for improved program efficiency. J. of Systems Architecture, 51(4):223–250, 2005. [2] J. Brock, X. Gu, B. Bao, and C. Ding. Pacman: Program-assisted cache management. ISMM, 2013. [3] X. Ding, K. Wang, and X. Zhang. ULCC: a user-level facility for optimizing shared cache performance on multicores. PPoPP, pages 103–112, 2011. [4] S. Rus, R. Ashok, and D. X. Li. Automated locality optimization based on the reuse distance of string operations. CGO, pages 181–190, 2011. [5] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. Power5 system microarchitecture. IBM J. Res. Dev., 49:505– 521, July 2005. [6] Z. Wang, K. S. McKinley, A. L.Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. PACT, 2002. [7] X. Yang, S. M. Blackburn, D. Frampton, J. B. Sartor, and K. S. McKinley. Why nothing matters: the impact of zeroing. OOPSLA, pages 307– 324, 2011.