Cache Rationing for Multicore Jacob Brock and Chen Ding Abstract

Cache Rationing for Multicore
Jacob Brock and Chen Ding
University of Rochester
{jbrock, cding}
As the number of transistors on a chip increases, they are used
mainly in two ways on multicore processors: first, to increase the
number of cores, and second, to increase the size of cache memory.
The two approaches intersect at a basic problem, which is how
parallel tasks can best share the cache memory. The degree of
sharing determines the available cache resource for each core and
hence the memory performance and scalability of the system. In
this paper, cache rationing is presented as a cache sharing solution
for collaborative caching.
Figure 1. Compared to partitioning the cache, sharing the cache
can change every third miss (‘m’) into a hit (‘h’), while using the
same amount of total space (4 blocks).
Cache sharing has a major shortcoming: it is risky since programs interfere with each other. One program may interfere and
degrade the cache performance for all its peers.
cache sharing, cache replacement policy, collaborative
Basic Methods of Cache Sharing
A basic solution to shared caching is partitioning. First, cache
memory can be partitioned in hardware. Current multi-core processors have part of the cache memory dedicated to each core and the
rest shared. For example, Intel Nehalem has 256KB L2 cache per
core and 4MB to 8MB L3 cache shared by all cores. IBM Power 7
has 8 cores, with 256KB L2 cache per core and 32MB L3 shared by
all cores. Shared cache can be partitioned using a software method
called page coloring, which maps program data to a sub-space in
cache. A recent, programmable system is ULCC [3].
Partitioned cache removes the interference between parallel
tasks. However, it limits the resource utilization. If only one program is running, the private cache in all but one core is not utilized.
In ULCC, insufficient knowledge of a parallel program or a user
mistake can leave part of the shared cache unutilized. This is a basic limitation — partitioning is by allocation, and a program cannot
use more than what is being allocated.
In shared cache, the default sharing is by priority. A program
may use the whole cache if no one else is running. Since a program
does not use the same amount of data all the time, sharing the cache
can be more beneficial than partitioning it. As an example, Figure 1
compares two programs and their hit/miss sequences when sharing
a 4-block cache and when dividing it. The benefit of sharing is
significant. In either program, every third miss in the partitioned
cache becomes a hit in the shared cache.
Cache Rationing
Neither of the previous sharing methods is ideal. Partitioning is robust but may be wasteful. Priority is efficient but volatile. The goal
of cache rationing is to retain the benefit of sharing while controlling the interference. It combines two types of priorities: the original priority for cache sharing, which is controlled by hardware;
and a second priority system for cache allocation, which is controlled by software. Next we describe the hardware interface, program transformation, and an overview of the multi-program cache
rationing and show the result of an example test.
Collaborative Caching
Cache rationing requires hardware support, in particular, a single
bit at each memory instruction to control whether cache should
store the accessed data. A limited form of such interface is the nontemporal stores on Intel machines, which have recently been used
to reduce string processing and memory zeroing time [4, 7]. Other
systems have been built or proposed for software hints to influence
hardware caching. Earlier examples include the placement hints
on Intel Itanium [1], bypassing access on IBM Power series [5],
the evict-me bit of [6]. Wang et al. called a combined softwarehardware solution collaborative caching [6].
Program Transformation
Optimal caching is impossible to implement purely in hardware
because it requires knowledge of future memory accesses. A solution, called Program Assisted Cache Management (Pacman), is
to compute the optimal solution in a profiling run, and apply that
solution to the actual run of the program [2]. When the program inputs change, a prediction is made for how the optimal solution will
change for a given memory access. Figure 2 shows an overview of
the system, and the three phases are outlined as follows:
1. Profiling Time After a training run, the memory access trace is
profiled to determine the OPT distance at each access. Patterns
are then identified in the sequence of OPT distances (which
Training Inputs X, Y
OPT Profiling and
Pattern Analysis
Actual Input Z
Caching on a
Processor With LRU/
MRU Hints
(Ideally) Optimal
Cache Utilization
Figure 2. Overview of the Pacman system with OPT pattern profiling, LRU/MRU loop splitting, and collaborative caching
3. Run Time At run time, based on the cache size and the input, a Pacman-optimized loop determines the actual division of
loop splitting to guide the cache with the LRU/MRU hints to
approach OPT cache management.
Rationing the Shared Cache
For the purpose of applying Pacman to multithreaded programs, we
define a new policy for shared caches: cache rationing. The cache
ration is the size of cache that Pacman assumes for each program in
order to supply hints. As far as the hints are concerned, the cache
is partitioned, but in practice the cache is shared. By promoting
moderation in this way, cache rationing adds the safety of cache
partitioning to the benefit of cache sharing.
For example, for a cache of size 10, and two corun programs, the
rationing might be 2:3 so that the Pacman threshold for Program A
is 4, and for Program B it is 6. Although the two programs read and
write on the same cache, each block of Program A with a predicted
forward OPT stack distance over 4 is tagged for MRU eviction;
each block of Program B with a predicted OPT stack distance over
6 is tagged for MRU eviction.
Cache rationing was chosen over cache partitioning because the
possibility for a cache “sipping” program to allow greater cache
usage of its corun partner was hypothesized to result in fewer
misses, and Figure 3 bears this out.
Figure 3 compares Pacman cache rationing to Pacman cache
partitioning with a streaming application and an implementation
of successive over-relaxation (SOR), which has been used to solve
linear systems of equations. The program traces were interleaved
with 4 SOR references for each Streaming reference, but the results
2. Compile Time Based on the patterns, a compiler performs loop
splitting so that at run time, the loop iterations can be divided
into groups where each memory access in the loop is tagged for
either LRU or MRU eviction (MRU if the patterns suggest it
would not be cached under OPT).
Miss Ratio
represent the smallest cache a memory block will go into under
OPT) for use at compile time to determine the best cache hints.
With multiple training runs using different input array sizes,
patterns for other inputs can be extrapolated.
Streaming rationed
Streaming partitioned
SOR rationed
SOR partitioned
(Loop) Code
are nearly identical for 1:1 or 1:4 interleaving. The cache was
rationed between SOR and Streaming from 10:90 to 90:10.
In all tests, rationing shows only very modest improvement
over partitioning for Stream. For SOR, however, miss ratios are
nearly halved for some cache sizes, almost to the miss ratio for
having the whole cache. In this range of cache sizes, Pacman with
rationing is approximately as effective as LRU with partitioning,
but in cases where Pacman outperforms LRU (see [2]) Pacman will
likely outperform LRU with either partitioning or sharing.
This example demonstrates the two potential advantages of Pacman cache rationing to manage program co-run: safety and sharing.
Co-run programs will not encroach on each other’s rations, but a
program can still utilize the available cache space when it is not
being used or rationed by other programs.
Cache Ration (MB)
Figure 3. Pacman miss ratios for cache rationed and cache partitioned runs of Streaming and SOR. Partitioned-cache points are
plotted for cache sizes of 0.125MB, 0.25MB, 0.5MB, and 1MB.
For rationed-cache points, the horizontal axis represents the cache
ration as a percentage of the total 1MB cache. The same traces were
used for training and testing.
