Cache Tuning Survey: Power & Energy Efficiency

A Survey on Cache Tuning from a Power/Energy Perspective
WEI ZANG and ANN GORDON-ROSS, University of Florida
Low power and/or energy consumption is a requirement not only in embedded systems that run on batteries
or have limited cooling capabilities, but also in desktop and mainframes where chips require costly cooling
techniques. Since the cache subsystem is typically the most power/energy-consuming subsystem, caches
are good candidates for power/energy optimizations, and therefore, cache tuning techniques are widely
researched. This survey focuses on state-of-the-art offline static and online dynamic cache tuning techniques
and summarizes the techniques’ attributes, major challenges, and potential research trends to inspire novel
ideas and future research avenues.
Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; B.3.2 [Memory
Structures]: Design Styles—Cache memories; B.3.3 [Memory Structures]: Performance Analysis and
Design Aids
General Terms: Design, Algorithms, Performance
Additional Key Words and Phrases: Cache tuning, cache partitioning, cache configuration, power saving,
energy saving
ACM Reference Format:
Zang, W. and Gordon-Ross, A. 2013. A survey on cache tuning from a Power/energy perspective. ACM Comput.
Surv. 45, 3, Article 32 (June 2013), 49 pages.
DOI: http://dx.doi.org/10.1145/2480741.2480749
In addition to continuously advancing chip technology, today’s processors are evolving
towards more versatile and powerful designs, including high-speed microprocessors,
multicore architectures, and systems-on-a-chip (SoCs). These trends have resulted
in rapidly increasing clock frequency and functionality, as well as increasing energy
consumption. Unfortunately, high energy consumption can exacerbate design concerns,
such as reducing chip reliability, diminishing battery life, and requiring high-cost packaging and cooling techniques. Therefore, low-power processor design is essential for
continuing advancements in all computing domains (e.g., personal computers, servers,
battery-operated portable devices, etc.).
Among all processor components, the cache and memory subsystem generally consume a large portion of the total microprocessor system power. For example, the ARM
920T microprocessor cache subsystem consumes 44% of the total power [Segars 2001].
Even though the Strong ARM SA-110 processor specifically targets low-power applications, the processor’s instruction cache consumes 27% of the total power [Montanaro
Authors' addresses: W. Zang and A. Gordon-Ross, Department of Electrical and Computer Engineering, University of Florida
W. Zang and A. Gorden-Ross
et al. 1997]. The 21164 DEC Alpha’s cache subsystem consumes 25% to 30% of the
total power [Edmondon et al. 1995]. In real-time signal processing embedded applications, memory traffic between the application-specific integrated circuit (ASIC) and the
off-chip memories constitutes 50% to 80% of the total power [Shiue and Chakrabarti
2001]. These power consumption statistics indicate that the cache is an ideal candidate
for power/energy reductions.
In this survey, we focus on cache power/energy-consumption reduction techniques.
Power is consumed in two fundamental ways: statically and dynamically. Static power
consumption is due to leakage current, which is present even when the circuit is not
switching, whereas dynamic power consumption is due to the switching activities of
capacitative load charging and discharging on transistor gates in the circuit. Since the
cache typically occupies a significant fraction of the total on-chip area (e.g., 60% in the
Strong ARM [Montanaro et al. 1997]), scaling down the cache size or putting a portion
of the cache into a low leakage mode potentially provides the greatest opportunity
in static power reduction. Alternatively, since the internal switching activity due to
cache accesses (reads/writes) and main memory accesses (fetching data to the cache)
largely dictates dynamic power consumption, reducing the cache miss rate provides the
greatest opportunity in dynamic power reduction. In addition, reducing the cache miss
rate reduces the application’s total number of execution cycles (improved performance),
which also reduces the static energy consumption. Therefore, cache optimization techniques that both reduce the cache size and miss rate are critical in reducing both static
and dynamic system power.
The main cache parameters that dictate a cache’s energy consumption are the total
size, block size, and associativity [Zhang et al. 2004]. Application-specific requirements
and behavior dictate the energy consumption with respect to these parameters. Different applications exhibit widely varying cache parameter requirements [Zhang et al.
2004]. For example, the cache size should reflect an/a application’s/phase’s working set
size. If the cache size is larger than the working set size, excess dynamic energy is
consumed by fetching blocks from an excessively large cache, and excess static energy
is consumed to power the large cache size. Alternatively, if the cache size is smaller
than the working set size, excess energy is wasted due to thrashing (portions of the
working set are continually swapped into and out of the cache due to capacity misses).
Spatial locality dictates the appropriate cache block size. If the block size is too large,
fetching unused information from main memory wastes energy, and if the block size
is too small, high cache miss rates waste energy due to frequent memory fetching
and added stall cycles. Similarly, temporal locality dictates the appropriate cache associativity. Adjusting cache parameters based on applications is appropriate for small
applications with little dynamic behavior, where a single cache configuration can capture the entire application’s temporal and spatial locality behaviors. However, since
many larger applications show considerable variation in cache requirements throughout execution [Sherwood et al. 2003b], a single cache configuration cannot capture all
temporal and spatial locality behaviors. Phase-specified cache optimization allows for
cache parameters and organization to be configured for each execution phase (phase
is the set of intervals within an application’s execution that have similar cache behavior [Sherwood et al. 2003b]), resulting in more energy savings than application-based
cache tuning [Gordon-Ross et al. 2008].
In multicore cache architectures, the cores’ cache parameters collectively dictate
the total energy consumption. In heterogeneous multicore architectures, each core can
have different cache parameters, and in homogenous multicore architectures, each
core has the same cache parameters. In multicore architectures where an application
is partitioned across the cores (as opposed to all cores running disjoint applications),
the cores’ caches directly influence each other, since the cores may have address/data
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
sharing interference and contention for shared cache resources. In addition to the
three main cache parameters, the cache organization in multicore architectures also
impacts the system energy consumption. The cache organization is dictated by cache
hierarchies, and in each hierarchy, the cores may have shared caches, private caches,
and/or a hybrid of shared and private caches.
Cache tuning is the process of determining the best cache configuration (specific
cache parameter values and organization) in the design space (the collection of all
possible cache configurations) for a particular application (application-based cache
tuning) [Gordon-Ross et al. 2004; Zhang et al. 2004] or an application’s phases (phasebased cache tuning) [Gordon-Ross et al. 2008; Sherwood et al. 2003b]. Cache tuning
determines the best cache configuration with respect to one or more design metrics, such
as minimizing the leakage power, minimizing the energy per access, minimizing the
average cache access time, minimizing the total energy consumption, minimizing the
number of cache misses, and maximizing performance, etc. Since high energy/power
consumption is becoming a key performance-limiting factor, this survey focuses on
cache tuning at the architectural level with respect to power/energy savings; however,
cache tuning techniques that focus on system performance optimization share similar
aspects. Cache tuning techniques estimate the cache performance (e.g., miss rates or
access delays) and energy consumption for the configurations in the design space and
determine the optimal cache configuration (or best configuration if determining the
optimal is infeasible) with respect to the design metrics. The cache tuning time is the
total time required to evaluate all of the cache configurations or a subset of the cache
configurations in the design space to determine the optimal/best cache configuration.
For systems with architectural support for executing multiple applications concurrently, the cache parameters’ dependency on all of the concurrent applications’
characteristics become intricate, thereby complicating cache tuning. In simultaneousmultithreaded (SMT) processors, fine-grained resource sharing requires all concurrently executing threads to compete for hardware resources, such as the instruction
queue, renaming registers, caches, and translation look-aside buffers (TLBs). In chip
multiprocessor (CMP) systems, only the last-level caches, memories, and communication subsystems are shared. In general, applications running on a CMP system have
less interaction (cache coherence and data dependency) than applications running on
an SMT processor. Systems that combine SMT and CMP further exacerbate cache
tuning complexity.
Even though cache tuning is a well-studied optimization technique, there is no
comprehensive cache tuning survey. Venkatachalam and Franz [2005] reviewed
power reduction techniques for microprocessor systems, which ranged broadly from
circuit-level to application-level techniques and included only a brief discussion of
cache tuning. Inoue et al. [2001] surveyed architectural techniques for caches from
high-performance and low-power perspectives. Uhlig and Mudge [1997] surveyed
cache tuning techniques that leveraged trace-driven cache simulation. That survey
elaborated on the three basic trace-driven cache simulation steps: trace collection,
trace reduction, and trace processing. We augment that survey with recent advances
in trace-driven cache simulation in Section In addition, several technical
papers [Dhodapkar and Smith 2003; Meng et al. 2005; Zhang et al. 2004; Zhou et al.
2003] provided thorough overviews on one or more specific cache tuning techniques.
Our survey distinguishes itself from previous surveys in that our survey provides
both an up-to-date and comprehensive cache tuning review and is, to the best of
our knowledge, the first such survey. We present comprehensive summaries, classify
existing works, and outline each cache tuning technique’s challenges with the intent
of inspiring potential research directions and solutions. Our survey contains several
overlapping topics that have been discussed in some previous reviews with respect to
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
computer architecture performance evaluation techniques [Eeckhout 2010], computer
architecture techniques for power-efficiency [Kaxiras and Martonosi 2008], and
multicore cache hierarchies [Balasubramonian 2011]; however, our survey differs from
these reviews in that our survey focuses on cache tuning.
Since the cache tuning scope is very broad, we outline the remainder of this survey’s
structure based on the following cache tuning classification: design-time offline static
cache tuning and run-time online dynamic cache tuning. Offline static cache tuning
is discussed in Section 2 with hardware support for core-based processors that tune
cache parameters at design time in Section 2.1 and offline cache tuning techniques
in Section 2.2. Online dynamic cache tuning is discussed in Section 3. Circuit-level
techniques for reducing leakage power in configurable architectures are reviewed in
Section 3.1. Configurable cache architectures in conjunction with dynamic cache parameter adjusting techniques for single-core architectures are reviewed in Section 3.2,
and Section 3.3 summarizes cache tuning techniques for multicore architectures. In
phase-based cache tuning, detecting the phase changes, that is, any change in the executing application such that a different cache configuration would be better than the
previous cache configuration [Gordon-Ross et al. 2008], is another interesting topic in
dynamic cache tuning, thus we discuss online and offline techniques to detect phase
changes in Section 3.4. Since various techniques in cache tuning classifications in each
section have their own merits, limitations, and challenges, we will elaborate on each of
these within each section. Finally, Section 4 reviews architectural-level power/energy
quantitative techniques and estimation tools, and Section 5 concludes our survey and
summarizes future directions and challenges with respect to cache tuning.
Cache tuning performed at design time is classified as offline static cache tuning. In
static cache tuning, designers evaluate an application or system during design time to
determine the cache requirements and optimal cache configuration. During runtime,
the system fixes the cache configuration to the offline-determined optimal cache configuration, and the cache configuration does not change during execution; therefore, there
is no runtime overhead with respect to online design space exploration. Since static
cache tuning determines cache parameters prior to system runtime, the actual runtime behavior with respect to dynamically changing system inputs and environmental
stimuli cannot be captured. Offline static cache tuning is suitable for stable systems
with predictable inputs and execution behavior.
We discuss core-based processors with hardware support for offline static cache tuning in Section 2.1. In Section 2.2, we review offline static cache tuning techniques using
simulators and analytical models. Finally, we summarize the offline static cache tuning
techniques in Section 2.3.
2.1. Hardware Support: Core-Based Processors
As reusable modules, intellectual property (IP) cores have a profound impact on system
design. Core-based processors can provide tuning flexibility, such as configurable cache
parameters, to a myriad of designers, enabling all designers to leverage and benefit
from a single IP core design. Therefore, configurable caches are readily available to
designers in core-based processors.
A soft-core processor model consists of a synthesizable HDL (hardware description
language) description that allows a designer to choose a particular cache configuration.
After a designer customizes the processor model, the soft-core model is synthesized for
the final platform.
Cache parameters can be specified during design time for many commercial softcore processors. In the MIPS32 4KTM [MIPS32 2001] processor cores, the data and
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
instruction cache controllers support caches of various sizes, organizations, and set
associativities. In addition to an on-chip variable-sized primary cache, the MIPS R4000
[1994] processor cores have an optional external secondary cache that varies in size
from 128 Kbytes to 4 Mbytes. Both caches provide variable cache block size. Some
ARM processor cores,1 such as the ARM 9 and ARM 11 families, provide configurable
caches with total size ranging from 4 Kbytes to 128 Kbytes. ARC configurable cores2 ,
such as the ARC 625D, ARC 710D, and ARC 750D, enable design-time configurability
of instruction and data cache sizes. Tensilica’s Xtensa Configurable Processors3 offer
multiple options for cache size and associativity.
Alternatively, some hard-core processors [Gordon-Ross et al. 2007; Malik et al. 2000;
Zhang et al. 2004] support offline cache tuning by providing runtime configurable
cache ways and/or sets (i.e., ways/sets can be selectively enabled/disabled) and a
configurable block size. These hard-core processors enable the system to self-configure
the cache to the designer-specified cache parameters at system startup. Since these
processors can also be leveraged in runtime cache tuning, we elaborate on these
processors in Section 3.2.
2.2. Offline Cache Tuning Techniques
Numerous techniques exist for offline static cache tuning, including techniques that
directly simulate the cache behavior using a simulator (simulation-based cache tuning,
Section 2.2.1) and techniques that formulize cache miss rates with mathematical models according to theoretic analysis (analytical modeling, Section 2.2.2). However, all
techniques provide the same basic output: the estimated or actual cache miss rates for
the cache configurations in the design space. Using these cache miss rates, designers
determine the optimal cache configuration based on the desired design metrics, such
as the lowest energy cache configuration.
2.2.1. Simulation-Based Cache Tuning. Simulation-based tuning is a common technique
not only for tuning the cache, but also for tuning any design parameter in the processor.
Cache simulation leverages software to model cache operations and estimates cache
miss rates or other metrics based on a representative application input. The simulator
allows cache parameters to be varied such that different cache configurations in the
design space can be simulated easily without building costly physical hardware prototypes. However, the simulation model can be cumbersome to setup, since accurately
modeling a system’s environment can be more difficult than modeling the system itself.
Additionally, since cache simulation is generally slow, simulating only seconds of application execution can require hours or days of simulation time for each cache configuration in the design space. Due to this lengthy simulation time, simulation-based cache
tuning is generally not feasible for large design spaces and/or complex applications.
In this section, we review simulators with respect to three different classification
criteria: accuracy-level classification classifies simulators as functional simulation or
timing simulation (Section; simulation input classification classifies simulators
as trace/event-driven simulation or execution-driven simulation (Section; and
simulation coverage classification classifies simulators as microarchitecture simulators
or full-system simulators (Section Microarchitecture and full-system simulators can provide functional accuracy or cycle-level accuracy and use trace/event-driven
or execution-driven inputs. Aside from general trace-driven simulation for all modules
in the microarchitecture, specialized trace-driven cache simulation (Section,
1 http://www.arm.com/products/processors/.
2 http://www.synopsys.com/IP/ConfigurableCores/ARCProcessors/Pages/Default.aspx.
3 http://tensilica.com/products/xtensa-customizable/configurable.htm.
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
which only simulates the cache module, is widely used for cache tuning. Finally, we
review simulator acceleration techniques and cache tuning acceleration using efficient
design space exploration in Sections and, respectively. Functional and Timing Simulation.
Functional Simulation. Functional simulation is also referred to as instruction-set
emulation, since functional simulation only provides the functional characteristics of
the instruction set architecture (ISA) without timing estimates. Therefore, functional
simulation is typically used for functional correctness validation. Cache hits/misses
can be generated from the functional simulation; however, since the timing-related
execution is not available and only the user-level code is simulated, functional simulation lacks accuracy. The advantage of functional simulation is fast simulation time.
Functional simulation can also be used to generate a functionally-accurate memory
access trace for trace-driven cache simulation (Section
The SimpleScalar Tool Set [Austin et al. 2002] is widely used in academia and industry for single-core processor simulation and contains two microarchitecture functional
simulators: sim-safe and sim-fast. SimpleScalar is available in several processor variations, such as the Alpha, PISA, ARM, and x86 instruction sets.
Timing Simulation. Timing simulation is typically performed in conjunction with
functional simulation. Timing simulation models architecture internals, such as
the pipeline, branch predictor, detailed cache and memory access latencies, etc.
and outputs timing-related statistics, such as cycles per instruction (CPI). Timing
simulation generates more accurate cache hit/miss rates than functional simulation,
since the timing-dependent speculative execution, out-of-order execution, and the inter
thread/intercore contentions to the shared resources are simulated correctly. More simulator development time and simulation time are required than functional simulation. Trace/Event-Driven and Execution-Driven Simulation.
Trace/Event-Driven. Trace/event-driven simulation decomposes the complete simulation into the functional simulation and timing simulation. During trace collection,
a functional simulator outputs the application’s instruction or execution-event trace,
which serves as input to a detailed timing simulation. In trace/event-driven simulation,
the functional simulation is performed only once, and the detailed timing simulation
for only the events of interest (i.e., the events related to interesting system aspects
that need to be evaluated) is performed many times to detect the correct occurrence
time of the traced instructions/events for different microarchitectures (with different
configurations). Therefore, the total simulation time for trace/event-driven simulation
is much less than that of timing simulation.
In addition to using a functional simulator, the traces can be collected with direct
application execution by running an instrumented binary on a host machine. Execution
on real hardware is much faster than functional simulation; however, the simulation
target architecture and ISA must be the same as the host architecture and ISA.
Existing instrumentation tools include Atom [Srivastava and Eustace 1994], Shade
[Cmelik and Keppel 1994], Pin [Luk et al. 2005], Graphite [Miller et al. 2010], and
CMP$IM [Jaleel et al. 2008a]. Specifically, CMP$IM is developed based on Pin for
multicore cache simulation.
Although trace/event-driven simulation is intuitively simple, this simulation
technique has several challenges. For example, trace/event collection can be difficult
for complex systems, such as systems with multiple processors, concurrent tasks,
or dynamically-linked/dynamically-compiled code. Since these traces are static and
reflect a single execution path, it is challenging for trace/event-driven simulation to
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
capture the dynamic behavior of system execution. If trace/event collection is not
accurate, trace simulation results may be inaccurate. Additionally, traces are typically
large (tens to hundreds of gigabytes), thus requiring significant storage space and
lengthy trace processing times. Previous works proposed trace sampling [Conte et al.
1998] and trace compression [Janapsatya et al. 2007] techniques to reduce the trace
size without excessive loss to the trace’s completeness and accuracy.
Rico et al. [2011] proposed a trace-driven simulation technique for multithreaded applications wherein trace-driven simulation was combined with the dynamic execution
of parallelism-management operations to track the dynamic thread scheduling and interthread synchronizations. Lee et al. [2010] developed tsim, which used a two-phase,
trace-driven simulation approach for fast multicore architecture simulation. To simulate an out-of-order processor, the timing error overhead was estimated during trace
collection in order to correct the timing errors in trace simulation. The authors also
proposed directly evaluating the trace changes along with the performance (e.g., cache
misses) during trace processing. To simulate the synchronizations, synchronization
primitives were recorded in the trace files to model resource contentions.
Execution-Driven Simulation. Execution-driven simulation tightly integrates the
functional and timing simulations. In execution-driven simulation, a simulator executes the application binary and additionally simulates the architecture’s behavior.
Unlike trace-driven simulation with large, static trace files, execution-driven simulation simulates dynamically-changing application execution, such as speculated executions, out-of-order executions, multithreaded and multicore accesses to shared caches,
and different execution paths based on different inputs. The trade-off for these additional details as compared to trace-driven simulation is significantly longer simulation
Execution-driven simulators are the most prevalent simulation tool used in previous
works. Many execution-driven simulators have been designed, including SimpleScalar,
M5 [Binkert et al. 2006], GEMS [Martin et al. 2005], Gem54 , SESC [Ortego and Sack
2004], RSIM [Hughes et al. 2002], Flexus [Wenisch et al. 2006], PTLSim [Yourst,
2007], etc. The most widely used execution-driven simulator for single processors is
SimpleScalar, in which sim-cache (which is specially designed for cache simulation)
provides many cache organization options. Gem5 (SE-mode), which is the combination of M5 and GEMS, supports execution-driven simulation for multicore microarchitectures. The default cache coherence is an abstract snooping protocol. Even though
the cache hierarchy is easily configured and the coherence protocol can automatically
extend to any memory hierarchy, the cache coherence protocol is difficult to modify.
SESC is a simulator used to simulate a superscalar processor. The processor core uses
execution-driven simulation, while the remainder of the architecture uses event-driven
simulation, and events are called and scheduled as needed. Although SESC can simulate CMP architectures, the multicore operations are actually emulated using multiple
threads. Therefore, SESC cannot simulate architectures that combine SMT and CMP.
Currently, SESC only supports the MIPS ISA. Microarchitecture and Full-System Simulation.
Microarchitecture Simulation. In general, microarchitecture simulators only simulate the user-level code and do not simulate the operating system (OS)-level code.
However, simulating only user-level code is not sufficient when the OS code has significant impacts on the execution (e.g., system calls, interrupts, input/output (I/O) events,
etc.). Some microarchitecture simulators use a proxy call mechanism by invoking the
4 The
Gem5 Simulater. http://gem5.org.
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
host OS to emulate the changes in register state and memory state due to system calls;
however, this OS emulation lacks flexibility and is not as accurate as direct OS-level
code simulation.
Full-System Simulation. Full-system simulators provide the highest fidelity to the
actual system, as compared to all other simulators, due to the full-system simulator’s
large-scale simulation coverage, which includes both user- and OS-level code and I/O
devices, such as Ethernet and disks. OS code simulation is especially critical for multithreaded and multicore architectures, since OS scheduling directly affects these architectures’ execution, and thus affects cache behavior. Full-system simulation acts as a
system virtual machine that includes all of the I/O, OS, memory, and other device activities. Therefore, full-system simulation covers more features of the target architecture
than microarchitecture simulation but requires extremely lengthy simulation time.
SimOS [Rosenblum et al. 1997] is a full-system simulator developed by Stanford
University. SimOS uses a simple cache-coherent, nonuniform memory access (CCNUMA) memory system with a directory-based cache-coherence policy. Virtutech Simics [Magnusson et al. 2002] is a well-maintained, commercially available full-system
functional simulator that supports many instruction sets, such as Alpha, PowerPC, x86,
AMD64, MIPS, ARM, and SPARC, and device models, such as graphics cards, Ethernet
cards, and PCI (peripheral component interconnect) controllers. Users can customize
these modules to simulate the target processor’s details. Simics is very flexible and
supports heterogeneous platforms with various processor architectures and operating
systems. However, Simics does not provide cycle-accurate simulation, and not all of the
source code is readily available. Another widely used full-system simulator is Gem5
(FS-mode). Gem5 provides cycle-level precise evaluation of the memory hierarchy and
interconnection, and supports the same architecture families as Simics. Gem5 is flexible and convenient for cache-based research, since SLICC (Specification Language for
Implementing Cache Coherence) allows users to customize cache coherence protocols
from directory- to snooping-based. However, simulation that includes a SLICC-defined
memory system increases the simulation time, and extending the SLICC protocol to
other cache hierarchies (i.e., the protocol is only employed for a specific cache hierarchy) is difficult. Other full-system simulators include M5, QEMU [Bellard 2005],
Embra [Witchell and Rosenblum 1996], AMD’s SimNow [Bedichek 2004], and Bochs
[Mihocka and Schwartsman 2008]. Specialized Trace-Driven Cache Simulation. For specialized trace-driven
cache simulation, cache behavior can be simulated using a sequence of time-ordered
memory accesses, typically referred to as an access trace. The access trace can be collected using any of the aforementioned simulators depending on the desired accuracy,
and therefore, the access trace may contain only user-level addresses or both user-level
and system-level addresses. Unlike general trace-driven simulation for a microarchitecture, trace-driven cache simulation only simulates the cache module. Since simulating
the entire processor/system is slow, trace-driven cache simulation can significantly reduce cache tuning time by simulating an application only once to produce the access
trace and then process that access trace quickly to evaluate the cache configurations.
Trace-driven cache simulation shares the same limitations as general trace-driven
simulation. For example, speculative execution is difficult to simulate accurately, since
trace collection typically does not collect the wrong-path execution, and since the access trace is statically generated for fixed inputs, the evaluated cache performance
may not apply to other input sets. In addition, modeling dynamic execution in real
systems is challenging, since the cache access trace is fixed with a particular timing order. For in-order single-processor simulation, the memory access trace does not
change as the cache architecture changes. Therefore, there is no timing dependency
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
for trace-driven cache simulation results. For out-of-order processors with dynamically
scheduled instructions, the memory access trace changes based on the different cache
misses and cache/memory access latencies. In previous works, Lee et al. [2009] and
Lee and Cho [2011] predicted the impact of different cache configurations on the cache
misses and approximated superscalar processor performance from in-order generated
traces. In a multithreaded/multicore architecture, the trace is typically timing- and
cache architecture-dependent. Goldschmidt and Hennessey [1992] investigated these
dynamic effects in multithreaded and multicore simulations. For shared caches, the
interleaved access order from different threads/cores changed across different cache
configurations due to the changes in the cache miss addresses and related latencies for
each individual thread. Additional factors, such as dynamic scheduling, barrier synchronization, and cache coherence, may also produce different timing-dependent traces
for different cache configurations. In developing trace-driven cache simulation for multithreaded/multicore architectures, all of these timing dependencies must be evaluated.
Even though the cache hit/miss rates are simulated correctly using trace-driven
cache simulation, the overall system energy consumption is difficult to evaluate.
When evaluating the energy consumption, other factors should be considered, such
as techniques used to hide cache miss latencies and the contention for other critical
shared resources in multithreaded/multicore architectures, which are not available in
trace-driven cache simulation.
Sequential Trace-Driven Cache Simulation. A simple naı̈ve technique for leveraging
trace-driven cache simulation for cache tuning is to sequentially process the access
trace for each cache configuration, where the number of trace processing passes is
equal to the number of cache configurations in the design space. This sequential
trace-driven cache simulation technique can result in prohibitively lengthy simulation
time for a large access trace, thus requiring lengthy tuning time for exhaustive design
space exploration.
Dinero [Edler and Hill 1998] is the most widely used trace simulator for single processor systems. Dinero models each cache set as a linked list where the number of
nodes in each list is equal to the cache set associativity, and each node records the tag
information for the addresses that map to that set. Cache operations are modeled as
linked list manipulations. For each trace address, Dinero computes the set index and
tag according to the cache configuration’s parameters and then checks the corresponding linked list to determine if that access is a hit or miss. Dinero then updates the set’s
linked list in accordance with the replacement policy. CASPER [Iyer 2003] provides
trace-driven cache simulation for both single- and multicore architectures. The multicore cache uses the MESI coherence protocol (which is the most common protocol where
every cache line is either in the modified, exclusive, shared, or invalid state) along with
several variations, and the cache hierarchies employ last-level shared caches. Cache
parameters are configurable for each cache level and heterogeneous configurations are
allowed, but the cores’ private cache configurations must be homogenous. CMP$IM is
a trace-driven multicore cache simulator that processes the trace addresses on-thefly, thus eliminating the overhead of storing large trace files. CMP$IM employs an
invalidation-based cache coherence protocol. Users can specify the cache parameters,
replacement policy, write policy, number of cache levels, and the levels’ inclusion policy.
Similarly to CASPER, CMP$IM requires homogenous private caches.
Single-Pass Trace-Driven Cache Simulation. To speed up the total trace processing time as compared to sequential trace-driven cache simulation, single-pass tracedriven cache simulation simulates multiple cache configurations during a single traceprocessing pass, thereby reducing the cache tuning time. The trace simulator processes
the trace addresses sequentially and evaluates each processed address to determine if
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
Stack Search
Case 1
Process A
Not found
Conflict evaluation
Stack Update
Compulsory miss
No evaluation
Case 2
Process E
Found conflict set
Fig. 1. Two address-processing cases in trace-driven cache simulation using the stack-based algorithm: case
1 depicts the situation where the processed address is not found in the stack, and case 2 depicts the situation
where the processed address is found in the stack.
the access results in a cache hit/miss for all cache configurations simultaneously. For
each processed address, the number of conflicts (i.e., previously accessed blocks that
map to the same cache set as the currently processed address) directly determines
whether or not the processed address is a cache hit/miss.
Evaluating the number of conflicts is very time consuming because all previously accessed unique addresses must be evaluated for conflicts. Single-pass trace-driven cache
simulation leverages two cache properties—the inclusion property and the set refinement property—to speed up this conflict evaluation. The inclusion property [Mattson
et al. 1970] states that larger caches contain a superset of the blocks present in smaller
caches. If two cache configurations have the same block size, the same number of sets,
and use access order-based replacement policies (e.g., least recently used (LRU)), the
inclusion property indicates that the conflicts for the cache configuration with a smaller
associativity also form the conflicts for the cache configuration with a larger associativity. The set refinement property [Hill and Smith 1989] states that the blocks that map
to the same cache set in larger caches also map to the same cache set in smaller caches.
The set refinement property implies that the conflicts for the cache configuration with
a larger number of sets are also the conflicts for the cache configuration with a smaller
number of sets if both cache configurations have the same block size.
Single-pass trace-driven cache simulation can be divided into two categories based on
the algorithm and data structure used: the stack-based algorithm and the tree/forestbased algorithm.
Early work by Mattson et al. [1970] developed the stack-based algorithm for fullyassociative caches, which served as the foundation for all future trace-driven cache
simulation work. Figure 1 depicts two address-processing cases in trace-driven cache
simulation using the stack-based algorithm. Letters A, B, C, D, E, and F represent
different addresses that map to different cache blocks. The trace simulator processes
the access trace one address at a time. For each processed address, the algorithm first
performs a stack search for a previous access to the currently processed address. Case
1 depicts the situation where the currently processed address has not been accessed
before and is not present in any previously accessed cache block (the address is not
present in the stack). Therefore, this access is a compulsory cache miss. Case 2 depicts
the situation where the currently processed address is located in the stack and has
been previously accessed. Conflict evaluation evaluates the potential conflict set (all
addresses in the stack between the stack’s top and the currently processed address’s
previous access location in the stack) to determine the conflicts. The number of conflicts
directly determines the minimum associativity (i.e., cache size for a fully-associative
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
Number of sets
Fig. 2. An example of the tree/forest structure for trace-driven cache simulation using the tree/forest-based
algorithm. The rectangles correspond to tree nodes, and values in the rectangles correspond to the indexes
of the cache sets.
cache) necessary to result in a cache hit for the currently processed address. After
conflict evaluation, the stack update process updates the stack’s stored address access
order by pushing the currently processed address onto the stack and removing the
previous access to the currently processed address if the current access was not a
compulsory miss. Hill and Smith [1989] extended the stack-based algorithm to simulate
direct-mapped and set-associative caches and leveraged the set refinement property to
efficiently determine the number of conflicts for caches with different numbers of sets.
Thompson and Smith [1989] introduced dirty-level analysis and included write-back
counters for write-back caches.
Since the time complexity of both the stack search for the processed address and the
conflict evaluation is on the order of the stack size (which in the worst case, is equal to
the number of uniquely accessed addresses), the stack search and conflict evaluation
can be very time consuming. To speed up the trace processing time, much work has
focused on replacing the stack structure with a tree/forest structure that stores and traverses accesses more efficiently than the stack structure by storing the cache contents
for multiple cache configurations in a single structure. The commonly used tree/forest
structure in trace-driven cache simulation uses different tree levels to represent different cache configurations with a different number of cache sets. Figure 2 depicts an
example of this tree/forest structure. In this example, the number of cache sets can be
configured as 2, 4, and 8, which requires two three-level trees. Each rectangle in the
figure represents a single node, where each node stores each cache set’s contents, and
the rectangle’s value corresponds to the cache set’s index address. In this manner, the
number of addresses stored in a node directly indicates the number of conflicts. When
processing a new address, the processed address is searched for in the tree and added
to the tree nodes based on the set that the processed address would map to in each
level. Since the tree/forest structure in Figure 2 can only simulate cache configurations
with a fixed block size, multiple trees/forests are typically used to simulate multiple
block sizes in a single pass [Hill and Smith 1989; Janapsatya et al. 2006].
One drawback of the tree/forest structure is redundant storage overhead, because
each unique block address must be stored in one node in each level. Sugumar and Abraham [1991] developed two structures/algorithms to simulate the cache configurations
with either a fixed block size or a fixed cache size while varying the remaining two
parameters. Their novel structures reduce the storage by a factor of two, as compared
to previous tree/forest structures.
Tree/forest-based algorithms have several disadvantages compared to stack-based
algorithms, such as increased complexity that requires complicated processing operations, which make tree/forest-based algorithms not easily amenable to hardware
implementation for runtime cache tuning. Therefore, the stack-based algorithm is still
widely used. Viana et al. [2008] proposed SPCE—a stack-based algorithm that evaluated cache size, block size, and associativity simultaneously using simple bit-wise
operations. Gordon-Ross et al. [2007] designed SPCE’s hardware prototype for runtime
cache tuning and recently extended the stack-based algorithm for exclusive two-level
cache simulation [Zang and Gordon-Ross 2011].
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
Although single-pass trace-driven cache simulation is a prominent solution for reducing tuning time, several drawbacks limit the single-pass trace-driven cache simulation’s
versatility. Both the stack- and tree/forest-based algorithms restrict the replacement
policy to accessing order-based replacement policies. Additionally, some single-pass
algorithms that simulate caches with only one or two variable parameters must reexecute the algorithm several times in order to cover the entire design space. Furthermore, single-pass trace-driven cache simulation for multilevel caches is complex.
For example, in a two-level cache hierarchy, the level-one cache filters the access trace
and produces one filtered access trace for each level-two cache (i.e., each unique levelone cache configuration’s misses form a unique filtered access trace for the level-two
cache). When directly leveraging single-pass trace-driven cache simulation for singlelevel caches to simulate each of the two-level caches, the two-level cache simulation
must store and process each filtered access trace separately to combine each leveltwo cache configuration with each level-one cache configuration. Since the number of
filtered access traces is dictated by level-one cache’s design space size, the time and
storage complexities can be considerably large for multilevel cache simulation. Finally,
single-pass trace-driven cache simulation does not support multithreaded/multicore
architectures. Simulation Acceleration. As today’s processors increase in complexity and
functionality, the simulation time required for detailed cycle-accurate simulation is
becoming infeasible, especially for multithreaded/multicore microarchitectures, fullsystems, and highly configurable caches. Therefore, multiple techniques are proposed
to accelerate the simulation time.
Instead of simulating the entire application, sampled simulation simulates only a
small fraction of the application—the critical regions. Intervals sampled from the critical regions are simulated to represent the entire application’s execution characteristics.
Challenges include determining the interval length and sampling frequency. The interval length should be long enough to capture the spatial and temporal localities of
the cache accesses, and the sampling frequency must be fast enough to capture all
relevant execution changes; however, long intervals and frequent sampling complicate
simulation and increase simulation time. Previous works suggest three techniques for
guiding interval length and frequency sampling selection. Random sampling [Laha
et al. 1988] is the easiest technique and evaluates cache performance at a random
sampling frequency with a fixed interval length, but may provide an inaccurate view of
an application’s entire execution behavior. Periodic sampling, such as SMARTS (Sampling Microarchitecture Simulation) [Wunderlich et al. 2003], selects the simulation
intervals using a fixed period across the entire execution, and the loss in performance
accuracy is quantified using statistical theory. Finally, the most complicated but accurate technique is to select a representative interval for each execution phase and then
estimate the overall performance using weighted accumulation. SimPoint [Hamerly
et al. 2005] is the most well-known tool for selecting such representative intervals
(Section 3.4.2.)
Another acceleration technique is statistical simulation [Eeckhout et al. 2003; Genbrugge et al. 2006; Joshi et al. 2006], which consists of three steps. The first step is
statistical profiling, which collects statistics about key application characteristics, such
as the statistical instruction types, control flow graph, branch predictability, and cache
behavior. The second step is synthetic trace generation, which produces a short synthetic trace that includes all of the statistical characteristics. The last step is statistical
processor modeling, which simulates the synthetically generated trace. The simulator
model is simple, such that the cache is not explicitly modeled, and instead, the simulator simulates the miss correlation and models the miss latency as delayed hits. Due to
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
the short synthetic trace and the simplified simulator, there is a significant speedup in
simulation time.
Finally, due to the high computing capability of multicore processors and machine
clusters, parallel computing provides a straightforward technique for speeding up
simulation-based cache tuning. Since each cache configuration can be evaluated independently, the design space can be distributed across multiple processors, and each processor simulates a different cache configuration simultaneously. However, this straightforward distribution does not speed up a single cache configuration’s simulation time,
which can range from hours to days or weeks for complex systems and/or lengthy applications. Therefore, much research focuses on more sophisticated parallel simulators
[Chidester and George 2002; Falcón et al. 2008; Chen et al. 2009], which essentially
partition the simulation work across multiple host threads, and the threads are synchronized in each/multiple simulated cycle(s) for cycle-accurate simulators. Specifically
for trace-driven cache simulation, Heidelberger and Stone [1990] partitioned the access
trace into non-overlapping segments for parallel simulation. Sugumar [1993] decreased
the stack processing time using a parallel stack search technique. Wan et al. [2009]
developed a GPU (graphics processing unit)-based trace simulator in which different
cache configurations were simulated in different threads in parallel. Additionally, for
each cache configuration simulation, the simulator maintained different cache sets in
parallel and searched the matched tag for the processed address in the cache set with
a large amount of cache ways in parallel. Cache Tuning Acceleration Using Efficient Design Space Exploration. Assuming that the cache performance evaluated using simulation is correct, an exhaustive search always determines the optimal cache configuration, because this technique
exhaustively evaluates all configurations in the design space. However, since design
spaces are typically large, the design exploration time (i.e., cache tuning time) for exhaustive search is prohibitively long. Viana et al. [2006] showed that if intelligently
selected, a small subset of cache configurations would yield effective cache tuning,
resulting in a fraction of the exhaustive design exploration time. Therefore, efficient
design space pruning techniques are critical for simulation-based cache tuning and
trade off accuracy for reduced tuning time by searching only a subset of the design
space. Even though these techniques do not guarantee optimality, careful design can
yield near-optimal results.
Heuristic search techniques, such as Zhang et al.’s [2004] single-level cache tuning
and Gordon-Ross et al.’s [2004, 2009] two-level and unified cache tuning, developed for
online dynamic cache tuning (Section 3.2.3), can be directly used for offline simulationbased cache tuning to prune uninteresting cache configurations from the design space
and significantly speed up design exploration. As an optimization problem, cache tuning can use genetic algorithms to accelerate design space exploration. Since the optimal
cache configuration may be different for each execution phase, offline exhaustive exploration of the design space for each phase is infeasible. To address this problem, Dı́az
et al. [2009] divided the execution into intervals with a fixed number of instructions,
and determined the optimal cache configuration for each interval using a genetic algorithm. The genetic algorithm modeled individuals in the population as a sequence
of configurations for all the consecutive intervals, and the evaluation metric was the
overall instructions per cycle (IPC). Given a basic set of configuration sequences, the
genetic algorithm obtained new configuration sequences as the IPC improved. During
runtime, the offline-determined optimal configuration sequence was directly used for
all of the intervals.
2.2.2. Analytical Modeling-Based Cache Tuning. Unlike cache simulation techniques that
require lengthy simulation/tuning time, analytical modeling directly calculates cache
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
misses for each cache configuration using mathematical models. Since the cache miss
rates are calculated nearly instantaneously using computational formulas, the cache
tuning time is significantly reduced. Analytical models require detailed statistical information and/or information on critical application events, which can be collected using
profilers, such as Cprof [Lebeck and Wood 1994], Ammons et al. [1997], ProfileMe [Dean
et al. 1997], and DCPI [Anderson et al. 1997].
Analytical models oftentimes use complex algorithms that can be difficult to solve
[Chatterjee et al. 2001; Ghosh et al. 1999]. Some analytical models reduce the complexity by introducing empirical parameters inferred from regression [Ding and Zhong
2003], and some analytical models can only produce statistically accurate cache miss
rates by assuming the accessed cache blocks map to all cache sets uniformly [Brehob
and Enbody 1996; Zhong et al. 2003], which violates actual cache behavior. Other
analytical models are derived from the information gathered by a detailed compiler
framework [Chatterjee et al. 2001; Ghosh et al. 1999; Vera et al. 2004], which may
result in inaccurate miss rates due to limited compiler information and unpredictable
hardware. Therefore, analytical modeling-based cache tuning is typically less accurate
than simulation-based cache tuning.
Previous works mostly focus on two distinct analytical modeling categories based on
either application structures or access traces. Analytical Modeling Based on Application Structures. Since an application’s
spatial and temporal locality characteristics, which are mainly dictated by loops, determine cache behavior, cache misses can be estimated based on application structures
gathered from specially designed loop profilers. However, since loop characteristics
do not sufficiently predict exact cache behavior, which also depends on data layout
and traversal order, estimating an application’s total cache behavior based only on an
application’s loop behavior may be inaccurate.
Ghosh et al. [1999] generated a set of cache miss equations that summarized an
individual loop’s memory access behavior, which provided precise cache miss rate
analysis independent of the data layout. However, directly calculating the cache miss
rate from these equations was NP-complete. Vera et al. [2004] proposed a fast and accurate approach to estimate cache miss equation solutions using sampling techniques
to approximate the absolute cache miss rate for each memory access. Since both of
these models were restricted to perfectly nested loops with no conditional expressions,
these models had limited applicability. Chatterjee et al. [2001] extended the model
to include nonlinear array layouts using a set of Presburger formulas to characterize
and count cache misses. Harper et al. [1999] presented an analytical model that
approximated the cache miss rate for any loop structure. Vivekanandarajah et al.
[2006] determined near optimal cache sizes using a loop profiling-based technique
for loop statistic extraction. Cascaval et al. [2003] estimated the number of cache
misses based on stack algorithms. A stack distance histogram was collected using data
dependence information, which was obtained from the array accesses in the loop nests. Analytical Modeling Based on Memory Access Traces. Analytical modeling
based on a memory access trace is similar to trace-driven cache simulation, by using
a functional simulation to produce the memory access trace; however, instead of simulating cache behavior as in trace-driven cache simulation, mathematical equations
statistically or empirically analyze the memory access trace to determine the cache
miss rates.
Most analytical modeling based on a memory access trace leverages the reuse distance between successive accesses to the same memory address. The reuse distance
is the number of unique memory accesses (trace address or block address) between
these two successive accesses [Brehob and Enbody 1996; Ding and Zhong 2003] and
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
is essentially equal to the number of conflicts determined by stack-based trace-driven
cache simulation for a fully-associative cache [Mattson et al. 1970]. The reuse distance
captures the temporal distance between successive accesses and the spatial locality
when using individual cache block addresses as the unique memory access for determining the reuse distance.
Brehob and Enbody [1996] leveraged the reuse distance to calculate cache hit probabilities using a uniform distribution assumption for memory accesses. Zhong et al.
[2003] analyzed the variation in reuse distances across different data input sets to
predict fully-associative cache miss rates for any input. Fang et al. [2004] predicted
per-instruction locality and cache miss rates using reuse distance analysis. Ghosh
and Givargis [2004] used an analytical model based on access trace information to efficiently explore cache size and associativity and directly determine a cache configuration
to meet an application’s performance constraints.
For CMP architectures, Chandra et al. [2005] proposed a model using access traces
of isolated threads to predict interthread contention for shared cache. Reuse distance
profiles were analyzed to predict the extra cache misses for each thread due to cache
sharing, but the model did not consider the interaction between CPI variations and
cache contention. Chen and Aamodt’s [2009] model predicted fine-grained cache
contention, but only considered the CPI variations caused by cache contention, and
there was only one CPI correction iteration. The author’s model used the CPI acquired
in the absence of contention to predict the shared cache misses, and then the cache
misses were fed back to correct the CPI and provide a more accurate estimate. Eklov
et al. [2011] proposed a simpler model that calculated the CPI considering the cache
misses caused by cache contention. The authors calculated the CPI as a function of
the cache miss rates and then solved for the cache miss rates using this function.
Similarly, Xu et al. [2010] predicted the cache miss rates for a CMP’s shared cache
using the per-thread reuse distance. Instead of using simulation, the reuse distance
was acquired using hardware performance counters by running the application on
a real processor. Xiang et al. [2011] developed a low-cost profiling technique for
full-execution analysis with a guaranteed precision. The shared cache miss rate was
similarly modeled. All of these analytical models for shared cache miss rates assumed
the applications running on different threads/cores were independent with no data
sharing, and only shared cache contention among concurrent threads were considered.
Ding and Chilimbi’s [2009] locality model for multithreaded applications considered
both thread interleaving and data sharing. Shi et al. [2009] analyzed a CMP’s private
and shared caches, and data replication in the private caches was modeled. To simulate
real-time interactions among multiple cores, the authors developed a single-pass
trace-driven stack simulation, where a shared stack and per-core private stacks were
maintained to collect reuse distances to calculate the CMP private and shared cache
misses with various degrees of data replication. In addition to the analytical model
built on statistical characteristics, empirical models were developed for multicore
cache miss rate calculation, such as that at Gluhovsky and O’Krafka [2005] for Sun
microarchitectures. All of these analytical models are limited to in-order processors.
For out-of-order processors, modeling thread interleaving becomes highly complex.
2.3. Summary of Offline Static Cache Tuning
In offline static cache tuning, designers determine the optimal cache configuration to
meet an application’s specific design metrics before system execution. Table I summarizes the offline static cache tuning techniques and associated attributes reviewed
in this section. The first major column lists the cache tuning techniques, where some
major techniques are subdivided into different variations. The second major column
summarizes the main features of each technique. The third major column compares the
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
Table I. Summary of Offline Static Cache Tuning Techniques
Functional simulation
Timing simulation
Simulationbased cache
modelingbased cache
Only provides
functionally correct
Includes simulation
for timing related
1. Timing simulation is more
2. Functional simulation is faster
than timing simulation
(25-338X speedup [Lee et al.
Trace/event-driven simulation Uses a functional
1. For trace/event-driven
simulator to
simulation, trace/event
generate traces.
collection is challenging.
Trace is input into a
2. In trace/event-driven
timing simulator.
simulation, capturing the
Many simulation
dynamic behavior of
iterations with
applications and speculative
execution paths is challenging
and input sets are fixed, thus
Execution-driven simulation Integrates functional
trace/event-driven simulation is
and timing
generally less accurate than
execution-driven simulation.
Application binary
3. Trace/event-driven simulation
executes on a
is faster than execution-driven
simulation (125-171X speedup
[Lee et al. 2010]).
Microarchitecture simulation Widely used.
1. Full-system simulation is more
Limited to user-level
accurate than microarchitecture
code simulation.
simulation since the OS calls,
Emulates OS calls.
OS scheduling, and I/O events
can be simulated precisely.
Full-system simulation
Simulates both user- 2. Microarchitecture simulation is
and OS-level code.
faster than full-system
Includes system device simulation (182-338X speedup
[Lee et al. 2010]).
Acts as a virtual
Specialized Sequential
Simulates memory
1. Single-pass simulation
traceaccess trace for each
simulates multiple
configurations simultaneously,
Simulating only the
which is faster than
cache module is
sequentially simulating each
faster than
configuration (8-15X speedup
simulating the
[Zang and Gordon-Ross 2011]).
entire system
(100-1000X speedup
[Jaleel et al. 2008a]).
Trace is fixed. Cannot 2. Single-pass trace-driven cache
simulate dynamic
simulation is confined to a
changes in a real
limited cache organization
system. Not
sufficiently accurate. single-processor caches) and
access order-based replacement
Based on
Calculates cache
1. Fast.
hit/miss rates using 2. Complex algorithm.
3. Cache miss rates may be
4. Imposes some assumptions on
Based on memory
application code and/or
access trace
hardware characteristics.
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
techniques based on simulation/evaluation accuracy and speed aspects. The accuracy
in this table measures how closely the cache performance evaluated by the simulator or
the analytical model is to real hardware execution. In simulation time comparison, we
include some quantified simulation speedups collected directly from previous works.
These numbers provide only general comparisons, since the simulation speedups
may change considerably for different simulators, benchmarks, system settings, and
design spaces. In general, when comparing simulators, more accurate and higher
coverage simulators are more complicated to design and execute and thus have slower
simulation time.
We also discussed cache tuning acceleration techniques, which included simulation
and design space exploration acceleration. These two acceleration techniques are orthogonal and can be employed concurrently. To further increase the speedup, all of the
reviewed simulation acceleration techniques (including sampled simulation, statistical
simulation, and parallel simulation) can be applied together. However, the acceleration
is obtained at the cost of sacrificing accuracy, resulting in potentially suboptimal cache
In analytical modeling-based cache tuning, the model’s inputs generally consist of
the events of interest, detailed statistical information gathered during execution, or the
code and data flow structures collected using profilers and/or compilers. The profiler
can be built as part of the simulator or as real hardware to collect detailed information
during the application’s simulation or execution. However, statically gathered profile
information may be useless if the profiling information changes based on the input sets,
data layouts, and microarchitectures. Compiler information can only consider the code
layout characteristics, and even though it is difficult for a compiler to thoroughly analyze data, compilers provide a basic view of static control and data flow and can easily
modify the code, which can provide a basic analysis of instruction cache performance
and phase changes.
Several offline static cache tuning characteristics and techniques are amenable to
online dynamic cache tuning, such as efficient heuristic techniques to prune the design
space, hardware implementation of the single-pass trace-driven cache simulation
techniques, and analytical modeling based on memory access traces. In the next
section, we elaborate on online cache tuning techniques, including how these offline
static cache tuning characteristics and techniques can be leveraged during online
dynamic cache tuning.
Online dynamic cache tuning alleviates many of the drawbacks of offline static cache
tuning by adjusting cache parameters during runtime. Since online cache tuning
executes entirely during runtime, designers are not burdened with setting up realistic
simulation environments or modeling accurate input stimuli, and thus all design-time
efforts are eliminated. Additionally, since online cache tuning can dynamically react to
changing execution, the cache configuration can also change to accommodate new execution behavior or application updates. Online dynamic cache tuning is appropriate for
multicore architectures, since offline prediction of coscheduled applications is difficult
due to variations in the applications’ execution times with respect to different cache
configurations. Additionally, an exponential increase in the number of combinations
of coscheduled applications makes offline cache tuning for all possible combinations
However, online dynamic cache tuning requires some hardware overhead to monitor cache behavior and control cache parameter changes, which introduces additional
power, energy, and/or performance overheads during cache tuning and can diminish
the total savings. Thus, online cache tuning time is more critical than offline cache
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
tuning time. Online cache tuning must determine the optimal cache configuration fast
enough to react to execution changes, and the configurable cache architecture must
have a lightweight mechanism for dynamically changing the cache parameters that
incurs minimal overheads without affecting system performance.
Whereas a soft-core configurable cache (Section 2.1) can be custom-synthesized for a
particular cache configuration, that configuration remains fixed for the entire lifetime
of the system. Alternatively, a hard-core configurable cache allows the cache parameters and organization to be varied during runtime and includes an integrated online
cache tuning technique implemented as specialized hardware or a custom coprocessor
to adjust the parameter and organization. Since hard-core configurable cache architectures are based on circuit-level techniques that allow for selectively activating portions
of the cache elements (such as cache blocks, ways, or sets) to reduce leakage power,
we first review circuit-level techniques leveraged by configurable caches in Section 3.1.
Section 3.2 reviews the configurable cache architectures in conjunction with the architectures’ corresponding cache tuning techniques for single-core architectures. Section 3.3 reviews some prevalent cache tuning techniques for multicore architectures,
which share fundamental ideas with single-core cache tuning; however, special considerations must address multicore cache tuning challenges, such as cache interference
between multiple threads/cores and complicated shared and private last-level cache
In online dynamic cache tuning, the cache tuning process occurs either at the beginning of application execution for application-based cache tuning or at specified intervals
during application execution for phase-based cache tuning. In phase-based cache tuning, not only how to tune the cache (how to explore the design space), but also when to
tune the cache is important. Phase-change detection techniques are widely studied and
readily incorporated transparently into configurable cache architectures. Section 3.4
summarizes phase-change detection techniques with respect to cache tuning for both
single- and multicore architectures.
3.1. Circuit-Level Techniques for Reducing Leakage Power
To reduce leakage power consumption, major semiconductor companies employ high-k
[Bohr et al. 2007] dielectric materials to replace the common SiO2 oxide material in
the process technologies. Since the process-level solution can be orthogonally combined
with architecture-level techniques and the process-level solution was scarcely considered in architecture-level cache tuning design, this survey only focuses on circuitlevel techniques that are commonly used in architectural-level cache tuning hardware,
rather than the process-level solutions for semiconductors.
Since the number of active transistors directly dictates a circuit’s leakage power, deactivating unused transistors can reduce the leakage power. For example, infrequently
used or unused cache blocks/sets can be turned off or placed into a low leakage mode using techniques, such as a gated-Vdd cache (Section 3.1.1), a drowsy cache (Section 3.1.2),
or threshold voltage (VT ) manipulation (Section 3.1.3).
3.1.1. Gated-Vd d Cache. Powell et al. [2000] developed the gated-Vdd cache, wherein
memory cells (e.g., SRAM) could be disconnected from the power and/or ground rails
using a gated-Vdd transistor. Figure 3 depicts a memory cell with an NMOS gatedVdd . During a cache access, the address decode logic determines the wordlines to be
activated, which causes the memory cell to read the values out to the precharged
bitlines. As shown in Figure 3, the two inverters have a Vdd to Gnd leakage path. If
the gated-Vdd transistor is turned on, the memory cell is in an active mode for normal
operations. Otherwise, the memory cell is in a sleep mode and quickly loses the stored
value. Recharging the power supply sets the memory cell to a random logic state. The
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
Fig. 3. A memory cell with an NMOS gated-Vdd [Powell et al. 2000].
size of the gated-Vdd transistor should be designed carefully. The gated-Vdd transistor
should be large enough so that the transistor does not impact the cell read/write speed
while in the active mode. However, too large of a gated-Vdd transistor will diminish the
sleep mode functionality and reduce the energy savings. Since the gated-Vdd transistor
can be shared among multiple memory cells, the area overhead is very small.
Although the sleep mode efficiently reduces the leakage power, a cache block in sleep
mode does not preserve the stored data and an access to that data results in a cache
miss and an associated wakeup time. Since the wakeup time and cache miss can impose
significant performance overheads, cache tuning techniques that leverage gated-Vdd
should conservatively place cache blocks into sleep mode. Gated-Vdd was implemented
in several configurable cache architectures for dynamic tuning, such as the DRI (DeepSubmicron Instruction) cache [Powell et al. 2001] and cache decay [Kaxiras et al. 2001;
Zhou et al. 2003] (Section 3.2).
3.1.2. Drowsy Cache. In a cache with gated-Vdd , the supply voltage for a memory cell
is either fully on or fully gated (off). The drowsy cache [Flautner et al. 2002] provides
a compromise between on and off by reducing the supply voltage as low as possible
without data loss. The memory cells that are not actively accessed can be voltagescaled down to a drowsy mode to reduce the leakage power. However, the scaled voltage
makes the circuit process variation-dependent and more susceptible to single event
upset noise. These problems can be relieved by using an appropriate process technique
in semiconductor production and choosing a conservative Vdd value.
Rather than completely cutting off the circuit connection to power and/or ground
rails in a gated-Vdd cache, the nonzero supply voltage in the drowsy mode can preserve
the memory cell’s state. With respect to the data-preserving capability, the drowsy
cache is referred to as a state-preserving cache and the gated-Vdd cache is referred to
as a state-destroying cache. In a drowsy cache, accessing a cache block in active mode
does impose any performance loss; however, accessing a cache block in drowsy mode
requires first waking up the block, otherwise the data may be read out incorrectly.
The transition from drowsy mode (low voltage) to full Vdd requires a one- or two-cycle
wakeup time, which is much less than a cache miss. Therefore, cache blocks can be
put into drowsy mode more aggressively than sleep mode. For the cache blocks with
data that will be accessed again but not for a sufficiently long period of time, drowsy
mode is superior to sleep mode because no cache miss is incurred. However, since a
drowsy cache does not fully turn off the memory cells, a drowsy cache cannot reduce
the leakage power by as much as a gated-Vdd cache.
Figure 4 depicts the schematic of a memory cell in a drowsy cache. The circuit uses
two PMOS gate switches to supply the normal supply voltage Vdd for the active mode
and the low supply voltage VddLow for the drowsy mode, respectively. The memory cell is
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
Fig. 4. Schematic of a memory cell in a drowsy cache [Flautner et al. 2002].
connected to a voltage-scaling controller, which determines the supply voltage between
the active mode and the drowsy mode based on the state of the drowsy bit. Various
dedicated tuning techniques have been developed to set/clear the drowsy bit based on
the cache access pattern (Sections 3.2 and 3.4).
3.1.3. Threshold Voltage (VT ) Manipulation. Manipulating the threshold voltage VT of transistors can reduce leakage power. However, since the device speed depends on the
difference between Vdd and VT , manipulating VT results in a trade-off between the
speed and power. Decreasing VT increases speed while consuming high power, whereas
increasing VT can significantly reduce the power consumption, but at the expense of
slower speed.
Dual-VT [Dropsho et al. 2002b] statically employed low-VT (high speed, high leakage) transistors on the critical paths and high-VT (low speed, low leakage) transistors
on the noncritical paths at design time. The cache was implemented using high-VT
transistors in the memory cells and low-VT transistors in all other areas within the
SRAM. Dynamically raising the VT was accomplished by modulating the back-gate bias
voltage for the multithreshold-CMOS (MTCMOS) [Hanson et al. 2003]. The memory
cells with high VT were set to drowsy mode while preserving the cells’ value. Another
technique used dynamic forward body-biasing to low-leakage SRAM caches [Kim et al.
2005] wherein only the active portion of the cache was modulated to low-VT for fast
reads and writes. The MTCMOS technique is similar to the drowsy cache with respect
to modulating the power supply voltage, and a cache block in drowsy mode must be
woken up prior to accessing the cache block’s data. As compared to the drowsy cache,
the MTCMOS cache provides a higher reduction in leakage power; however, the tradeoff is more complicated circuit control and a higher overhead due to the longer drowsy
mode wakeup time.
3.1.4. Summary. Taking the CMOS fabrication cost and the compromise between leakage power savings and the device speed into consideration, the gated-Vdd and drowsy
caches are the most suitable for state-destroying and state-preserving techniques, respectively, as compared to threshold voltage manipulation [Chatterjee et al. 2003].
Therefore, the gated-Vdd and drowsy caches are given more attention in architecturallevel configurable cache design than threshold voltage manipulation.
Based on different features provided by the gated-Vdd cache and the drowsy cache,
Meng et al. [2005] explored the limits of leakage power reduction for the two techniques
using a parameterized model. Li et al. [2004] compared the effectiveness of the two
techniques for different level-two cache latencies and different gate oxide thickness
values and showed that the gated-Vdd cache could be superior to a drowsy cache when
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
Configurable cache
Parameter change
System metric
System metric
Best cache parameters
or parameters adjustment
Cache tuning
Fig. 5. Cache tuning components.
the level-two cache access time was sufficiently fast. Li et al. [2002] concluded that
using a gated-Vdd cache for the level-one cache and a drowsy cache for the level-two
cache yielded substantial leakage power savings without significantly impacting the
A hybrid of the drowsy and gated-Vdd techniques can be used in a single cache.
The cache blocks with a short period of inactivity are put into drowsy mode. If those
cache blocks continually stay inactive for a long period, the blocks are powered off
to sleep mode. In the hybrid cache, the problems for the state-destroying cache, such
as the large overhead in accessing the deactivated cache blocks and the extra energy
consumed during the long waiting period of detecting the unused blocks, are alleviated.
Zhang et al. [2002] used compiler analysis for loop instructions and data flow to guide
the power-mode switch between active, drowsy, and sleep modes. Power-mode control
instructions for changing the supply voltages were inserted into the code in order to
activate/deactivate cache blocks during runtime.
3.2. Cache Tuning for Singe-Core Architectures
Figure 5 depicts the three essential cache tuning components and the interactions
between these components. The configurable cache architecture contains a parameter
change controller that orchestrates the cache parameter changing process. Circuit-level
techniques that selectively activate portions of the cache elements on different levels
of granularity provide the basis for adjusting the total cache size, block size, and associativity. The parameter change controller leverages these circuit-level techniques to
dynamically adjust the cache parameters. System metric tracking can directly monitor
the system’s power/energy consumption, collect cache misses within a time interval, or
record cache access patterns using specially-designed hardware counters or software
profiling. The cache tuning decision evaluates the system metrics to determine the best
(optimal may not be known) cache configuration using one of two techniques: (1) evaluating a series of configurations and selecting the best configuration; or (2) increasing
or decreasing (adjusting) the cache parameters based on the system metric tracking
results. The cache tuning decision can be implemented in either hardware or software,
where the optimization target criteria (e.g., lowest energy consumption) are calculated
from the collected system metrics for different cache configurations and are compared
to determine a better cache configuration than the previous configuration. The cache
tuning decision, best cache parameters, or parameter adjustment, is input into the
configurable cache architecture, where the parameter change controller changes the
cache parameters accordingly.
In this section, we discuss cache tuning techniques, categorized based on the technique’s cache parameter adjustment capabilities, in conjunction with the configurable
cache architecture leveraged by the cache tuning techniques.
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
3.2.1. Way/Bank Management. Cache way/bank management provides the mechanism
with means to adjust the total cache size and associativity using the following techniques: shutting down cache ways/banks (configurable size); concatenating cache ways
(configurable associativity); configuring individual cache ways to store instructions
only, data only, both instructions and data, or shutdown (way designation); and/or
partitioning a large cache structure into multilevel caches.
Motorola’s M∗ CORE [Malik et al. 2000] processor included a four-way configurable
level-two cache with way designation. Albonesi et al. [1999] proposed a configurable
cache that partitioned the data and tag arrays into one or more subarrays for each cache
way. First, the application was run with all ways enabled and performance-sampling
techniques were used to collect cache access information. Based on this information,
the performance degradation and energy savings afforded by disabling ways were estimated. Balasubramonian et al. [2000] improved that configurable cache by partitioning
the remaining ways as level-two cache. Therefore, the cache could be virtually configured as a single- or two-level cache hierarchy. For each cache access, the level-one
partition was checked first. On a cache miss, the level-two partition was checked. During cache tuning, the level-one cache was initialized to the smallest size. After a certain
interval, if the accumulated miss rate was larger than a predetermined threshold, the
level-one cache was increased to the next larger size until the level-one cache size
reached the maximum value or the miss rate was sufficiently small. Amongst the examined cache configurations, the configuration with the lowest CPI was chosen. Based
on the way selective cache [Albonesi 1999], Dropsho et al. [2002a] partitioned the cache
ways into primary ways and secondary ways, where cache accesses checked only the
primary ways first, then checked the secondary ways on a primary way miss. In this
architecture, the LRU information was recorded for performance tracking, from which
the hit ratios for any partitioning of the cache were constructed to calculate the energy
consumption of each possible cache configuration. Kim et al. [2002] developed control
schemes for drowsy instruction caches in which only the bank that was predicted to be
accessed was kept active, while the other banks were put into drowsy mode. Ishihara
and Fallah [2005] proposed a non-uniform cache architecture where different cache
sets had different numbers of ways. The authors’ proposed cache tuning algorithm
greedily reduced the energy by iteratively searching the number of ways that offered
the largest energy reduction for each cache set. Ranganathan et al. [2000] partitioned
cache ways and used different partitions for different processor activities, such as
hardware lookup tables, storage area for prefetched information, or other modules for
hardware optimization and software controlled usage.
3.2.2. Block/Set Management. Block/set management provides a finer granularity management technique, compared to way/bank management. Instead of managing the
cache blocks/sets globally, the cache blocks/sets can be individually managed. However,
as a trade-off, this finer granularity requires more hardware overhead to monitor and
manage each individual cache block/set. A common block/set management technique
shuts down individual cache sets/blocks to vary the cache size and/or fetches or replaces
a variable number of cache blocks simultaneously to adjust the block size.
Powell et al. [2001] developed the DRI (Deep-Submicron Instruction) cache, which
downsized/upsized cache sets by powers of two. All of the tag bits necessary for the
smallest cache size were maintained, and cache lookup used an index mask to select
the appropriate number of index bits for the current number of sets. The number of
cache misses were accumulated with a counter after a fixed interval to determine the
cache upsize/downsize based on the relationship between the measured miss rate and
a preset value. Kaxiras et al. [2001] varied the cache size on a per-cache block basis by
shutting down cache blocks that contained data that was not likely to be reused. A decay
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
counter associated with each cache block was inspected at an adaptively varied decay
interval to determine when a cache block should be shut down. Adaptively adjusting
the decay interval was critical to maximizing the performance and energy savings. An
aggressively small decay interval could cause extra cache misses due to shutting down
cache blocks that were still “alive,” thereby destroying the cache resizing advantages.
If the decay interval was too large, the unused cache blocks remained active for a
longer period of time while waiting for the decay interval to expire, thus expending
extra energy. Flautner et al. [2002] provided two schemes for managing the cache
blocks. One scheme periodically put all cache blocks into drowsy mode and drowsy
blocks were woken up when the block was accessed. This scheme required only a single
global counter; however, the performance overhead could be high due to the extra cycles
spent on the drowsy blocks’ wakeup time. The second scheme put the cache blocks into
drowsy mode if the block had not been accessed during a decay interval. This scheme
had more hardware overhead due to monitoring per-block accesses. Hu et al. [2003]
improved the drowsy cache control proposed by Flautner et al. [2002] by exploiting the
code behavior to identify subroutine instructions. These instructions were tagged as
hotspots and were kept active and excluded from drowsy control. Zhou et al. [2003] only
deactivated the data portion of cache blocks and kept the tag active. The decay interval
was dynamically adjusted by monitoring the miss rate caused by the deactivated blocks
acquired from the active tags. If the miss rate caused by the deactivated blocks was
too small, the decay interval was decreased and the total percentage of deactivated
blocks was increased. Otherwise, the decay interval was increased to deactivate cache
blocks more conservatively. Ramaswamy and Yalamanchili [2007] varied the cache size
by folding/merging cache sets with a long period of little or no activity and splitting
sets if the number of accesses to the merged sets exceeded a preset threshold within a
time interval. Since folding/merging cache sets caused the set indexes to differ from the
original set indexes before folding/merging, a hardware-based lookup table maintained
the new set indexes, which incurred one extra cycle. Veidenbaum et al. [1999] leveraged
a base cache with a small physical block size and included the ability to dynamically
fetch and replace a variable number of blocks simultaneously according to the exhibited
spatial locality. Chen et al. [2004] also presented a configurable cache with variable
cache block size. The block size was controlled by a spatial pattern predictor, which
used access history information to predict future cache block usage and predicted the
number of cache blocks to be fetched. In addition to fetching a multiple number of
cache blocks simultaneously, the configurable cache enabled the fetching of subblocks
that were predicted as to-be-referenced and deactivated subblocks that were predicted
as unreferenced to save leakage energy.
3.2.3. Cache Size, Block Size, and Associativity Management. Since adjusting any of the
cache parameters can impact energy consumption and increased cache configurability (more possible parameter values) results in increased energy savings, Zhang et al.
[2004] designed a highly configurable cache with configurable size (using way shut
down), associativity (using way concatenation), and block size (by fetching additional
blocks through block concatenation). The authors designed a configuration circuit that
used partial tag bits as inputs to implement way concatenation in logic. Since the configuration circuit was executed concurrently with the index decoding, the configuration
circuit’s delay did not increase the cache lookup time or the critical path delay. Cache
tuning was complicated since variations in the cache performance metric did not intuitively indicate which cache parameter should be downsized/upsized. Additionally,
similarly to offline cache tuning (Section, exhaustive exploration was not feasible for online cache tuning due to lengthy cache tuning time. Runtime design space
exploration typically executes the application for a period of time in each potential
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
Table II. Summary of Online Dynamic Cache Tuning Techniques
Management techniques
1. Shut down some ways.
2. Concatenate ways.
3. Configure individual
ways for different
usage patterns.
4. Partition ways into
level one and level two
1. Shut down some
2. Fetch/replace a
variable number of
Size, block
size, and
1. Way shutdown.
2. Way concatenation.
3. Block concatenation.
Varied cache
Block size
Block size
1. Small changes to
a conventional
2. Small hardware
1. Limited cache
1. Provides finer
1. Finer granularity
2. More energy
2. More hardware
savings due to
overhead for finer
finer granularity
granularity tuning.
1. Increased
1. Large design space.
2. Large tuning time
2. Increased energy
and tuning energy
even with heuristics.
configuration until the best configuration is determined. However, during this exploration, many inferior configurations are executed and introduce enormous energy and
performance overheads.
Heuristic search techniques reduce the number of configurations explored, which
significantly reduces the online cache tuning time and energy/performance overheads.
Zhang et al. [2004] analyzed the impact of each cache parameter on the cache miss
rate and energy consumption. According to the increasing order of the parameters’
impact on energy, Zhang developed a heuristic search for a single-level cache that
determined the best cache size, block size, and then associativity. The impact-ordered
heuristic search can prune tens to thousands of configurations from the design space.
The number of pruned configurations depends on the number of configurable parameter
values. Based on Zhang’s impact-ordered single-level cache tuning heuristic, GordonRoss et al. [2004] developed a two-level cache tuner called TCaT. TCaT leveraged
the impact-ordered heuristic to interlace the exploration of separate instruction and
data level-one and level-two caches. TCaT explored only 6.5% of the design space and
determined cache configurations that consumed only 1% more energy than the optimal
cache configuration. Gordon-Ross et al. [2009] extended TCaT for a unified second-level
cache, which achieved an average energy savings of 61% while exploring only 0.2% of
the design space on average.
3.2.4. Summary. Configurable cache architectures leverage circuit-level techniques to
vary cache parameters. Based on the cache parameter adjustment capability, online
cache tuning techniques can be classified as way/bank management, block/set management, and cache size, block size, and associativity management. Table II summarizes
these classifications: the first column lists the techniques, the second column lists the
techniques to manage the configurable cache architectures, the third column indicates
the corresponding varied cache parameters, and the fourth and fifth columns indicate
the corresponding advantages and drawbacks, respectively.
Although the individual block/set management’s finer granularity potentially provides more energy savings as compared to the global block/set or way/bank managements’ coarser granularity, coarse granularity has several advantages over fine granularity. Coarse granularity management requires only small changes to a conventional
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
A Survey on Cache Tuning from a Power/Energy Perspective
set-associative cache and introduces small hardware overhead. In finer granularity
management, such as per-cache block management, a cache block access monitor (e.g.,
counter) and additional controller are required for each cache block. Comparatively,
in coarse granularity management, only one global controller and monitor is required
for all ways or for the entire cache. In addition, an address’s index scales well with
changes in cache size, associativity, and block size in coarse granularity management.
However, coarse granularity management limits the configurable cache design space.
For instance, way/bank management is limited by the cache’s base associativity. Caches
with a high base associativity provide more configurability than caches with a low base
associativity, but high base associativity caches have longer cache access time and more
tag storage space as a trade-off.
For dynamic data cache tuning, modified cache blocks may complicate cache resizing. When a dirty cache block in a write-back cache is deactivated, the data consistency/coherency must be maintained in case the block is reactivated and the modified
cache blocks are accessed again. The most straightforward technique is to flush all
of the cache blocks before the blocks are disabled. However, this cache flush imposes
large performance and energy overheads, especially for a multilevel cache hierarchy
or a multicore cache, and diminishes the energy savings. To minimize cache flushing, Zhang et al. [2004] increased the cache size and associativity values instead of
decreasing the values. For the state-preserving cache, Albonesi et al. [1999] provided
an alternative technique instead of cache flushing. The cache coherence behaved as
if the entire cache was enabled. In addition, the authors designed hardware that allowed the cache controller to read blocks from deactivated ways into activated ways
using a fill data path and then invalidated the blocks in the deactivated ways. For
the state-destroying cache, deactivating a dirty block requires a write-back. If multiple
blocks are written back simultaneously as a result of deactivating a large portion of
the cache, bus traffic congestion may occur due to the limited main memory bandwidth. Redistribute write-back techniques [Lee et al. 2000] that write dirty cache
blocks to main memory prior to the block’s eviction/deactivation can be used to avoid
this congestion.
3.3. Cache Tuning for Multicore Architectures
Some single-core cache tuning techniques can be leveraged in multicore cache tuning,
such as unused way/set shutdown, way concatenation, and merging/splitting cache
sets. However, when adjusting the cache parameters, the interdependency between
cores must be considered, such as shared data consistency and resource contention.
Additionally, various multicore cache organizations bring new configuration opportunities and challenges, which increases the cache tuning complexity.
Section 3.3.1 discusses efficient design space exploration techniques for heterogeneous cache architectures, where each core’s private cache can be tuned with a different
size, block size, and associativity. These exploration techniques are limited to the first
level of cache and techniques for multilevel caches and shared cache parameter and
organization configurations are still open research topics.
Most research has focused on the design and optimization of a CMP’s last-level cache
(LLC) due to the large effects that the LLC has on system performance and energy
consumption. The LLC is usually the second- or third-level cache, which is typically very
large and requires long access latencies that may be non-uniform (e.g., non-uniform
access architecture (NUCA) caches). In static-NUCA (S-NUCA), data blocks mapping
to a cache bank are statistically determined by the block addresses. In dynamic-NUCA
(D-NUCA), data blocks can migrate between banks based on the proximity of the
requesting cores, and thus reduce the hit latency. As a trade-off, locating a desired block
in D-NUCA may require searching multiple banks. Since the D-NUCA cache’s long data
W. Zang and A. Gorden-Ross
migration time and searching operations increase the cache’s dynamic energy and
degrade the system performance, cache tuning techniques for S-NUCA and D-NUCA
have different challenges. Additionally, since large LLCs occupy a large die area and
consume considerably high leakage power, deactivating unused portions of the LLC is
important for energy savings.
Shared LLCs have more efficient cache capacity utilization than private LLCs,
because shared LLCs have no shared address replication and no coherence overhead
for shared data modification. However, large shared LLCs may have lengthy hit latencies, and multiple cores may have high contention for the shared LLC’s resources even
though the threads do not have shared data. Such contention generates high interference between threads and considerably affects the performance of individual threads
and the entire system. To overcome the drawbacks of both shared and private LLCs, the
cache capacity can be dynamically managed by assigning a variable amount of private
and shared LLC capacity for each core. This dynamic sharing/partitioning potentially
presents an important multicore cache tuning opportunity. Section 3.3.2 reviews techniques that dynamically partition the shared cache capacity between cores or partition
private caches to allow cores to share the cache capacity. Section 3.3.3 summarizes this
section’s contributions.
3.3.1. Efficient Design Space Exploration. Efficient single-core cache design space exploration (Section 3.2.3) provides a fundamental basis for multicore cache design space exploration, but multicore design space exploration introduces additional, complex challenges. In heterogeneous multicore systems, the design space grows exponentially with
the number of cores. Cores without data sharing can leverage single-core cache-tuning
heuristics individually; however, cache tuning should not simply commence on each
core simultaneously. Since cache tuning incurs energy, power, and performance overheads while executing inferior, non-optimal configurations, performing cache tuning on
each core simultaneously may introduce a large accumulated overhead. Additionally,
cores executing data-sharing applications cannot be tuned individually without coordinating the tuning and considering the cache interactions, which have circular tuning
dependencies, wherein tuning one core’s cache affects the behavior of the other cores’
caches. For example, increasing the cache size for one core increases the amount of data
that the cache can store and decreases the miss rate. However, this larger cache may
store more shared data, which increases the number of cache coherence evictions and
forced write backs for all other cores. Even though several previous works optimize an
entire heterogeneous multicore architecture using design space exploration, we limit
our survey to cache tuning techniques only.
Based on Zhang et al.’s single-core configurable cache architecture and cache tuning techniques [2004], Rawlins and Gordon-Ross [2011] developed an efficient design
space exploration technique for level-one data cache tuning in heterogeneous dual-core
architectures, where each data cache could have a different cache size, block size, and
associativity. Due to the cache interactions introduced by shared data, independently
tuning the three parameters for each cache (similarly to previous single-core heuristics)
was not sufficient. Additional adjustments were required based on certain conditions
surmised from empirical observations. The heuristic determined cache configurations
within 1% of the optimal configuration, while searching only 1% of the design space. The
authors extended this heuristic for data cache tuning in heterogeneous architectures
with any number of cores [2012]. In that work, Rawlins and Gordon-Ross classified
the applications based on data sharing and cache behavior and used this classification
to guide cache tuning, which reduced the number of cores that needed to be tuned.
The heuristic searched at most 1% of the design space and determined configurations
within 2% of the optimal configuration.
A Survey on Cache Tuning from a Power/Energy Perspective
3.3.2. LLC Partitioning. Shared and private LLCs both have advantages and drawbacks
with respect to the cache capacity utilization efficiency and thread interferences, thus
combining the advantages of both into a hybrid architecture is required. In a private
LLC, the cache capacity is allocated to each core statically. If the application running
on a core is small and the allocated private cache capacity is large, the excess cache
capacity is wasted. Alternatively, if the application running on a core is large and the
allocated cache capacity is small, performance may be compromised due to a large cache
miss rate. In a shared LLC, the cache capacity occupied by each core is flexible and can
change based on the applications’ demands. However, if an application sequentially
processes a large amount of data, such as multimedia streaming reads/writes, the
application will occupy a large portion of the cache with data that is only accessed
a single time; thus a large cache capacity may not reduce the cache miss rate for
this application. As a result, the coscheduled applications running on the other cores
experience high cache miss rates due to shared cache contention. However, in a shared
LLC, the capacity can be dynamically partitioned, and each partition is allocated to
one core as the core’s private cache, and this private cache’s size and associativity can
be configured. Alternatively, private LLCs can be partitioned to be partially or entirely
shared with other cores.
Cache partitioning consists of three components, the cache partitioning controller,
system metric tracking, and the cache partitioning decision, with functionalities similar
to the components in online cache tuning for single-core architectures (Section 3.2) and
can be implemented in the OS software or in hardware.
The cache partitioning controller can be implemented using a modified replacement
policy in a shared LLC or a coherence policy in a private LLC. By placing replacement constraints on the candidate replacement blocks in a cache set and the insertion
location for each core’s incoming blocks, an individual core’s cache occupancy can be
controlled. The cache can also be directly partitioned by the OS using OS-controlled
page placement.
System metric tracking requires tracking metrics for the entire system or each core.
One possible solution is to use dynamic set sampling [Qureshi and Patt 2006]. Dynamic
set sampling samples a group of cache sets and maintains the sets’ tags as if the
sampled sets were configured with one potential configuration in the design space
and/or as if the sampled sets were entirely allocated to one core. The sampled sets are
used to approximate the global performance of the entire cache. With multiple groups
of sampled sets such that each group represents each possible configuration, the best
configuration can be determined directly. Since the number of sampled sets increases
exponentially with the number of cores, most previous works designed novel techniques
to reduce the hardware overhead. Another system metric tracking technique that is
similar to single-core cache tuning monitors the cache access profile in hardware or
software and analyzes each core’s cache requirements and interactions with other cores.
Previous works present two common cache partitioning decision optimization targets.
One target is to optimize the overall performance, such as the summation/average of
raw/weighted IPCs or cache miss rates. The other target is to maintain performance
fairness across the cores. Hsu et al. [2006] compared and indicated that the cache
partitions could vary greatly with the two different targets. However, the correlation
between the two optimization targets and energy consumption has not been investigated, although compromising performance and energy consumption is an interesting research topic. Additionally, combining cache partitioning and cache configuration
(such as deactivating partial cache ways/sets) can enhance the performance and energy optimization. Since determining the optimal solution to resource partitioning is an
NP-hard problem [Rajkumar et al. 1997], greedy and heuristic techniques [Suh et al.
2004; Qureshi and Patt 2006] are used to make a quick cache partitioning decision.
W. Zang and A. Gorden-Ross
In NUCA architectures where non-uniform latencies exist in accessing different ways,
proximity-aware cache partitioning [Liu et al. 2004; Yeh and Reinman 2005; Huh et al.
2007; Dybdahl and Stenstrom 2007] is preferred to reduce hit latency.
In a distributed shared LLC, either an entire way or set is located in one cache
bank, allowing the cache to be partitioned either on a way or set basis. In this survey,
we do not review shared LLC bandwidth partitioning, although limited shared LLC
bandwidth can be a performance and energy bottleneck. Way-Partitioning. In way-partitioning, the LLC is partitioned using block
replacement management, and the way partitions are assigned to the cores either
physically or logically to ensure each core occupies no more than the cores’ quota.
Physical way-partitioning is on a way-granularity and uses ways as partition boundaries. The physical way-partitioning controller enforces cache blocks from each core
to reside in fixed physical ways. Even if an application running on one core does not
utilize the application’s assigned quota temporally, the other cores cannot leverage this
vacant cache capacity. Logical way-partitioning is on a block-granularity and logically
allocates blocks to each core. Partitioning boundaries are not strictly enforced and a
core can utilize another cores’ way quota if the other core has vacant cache capacity.
The logical way-partitioning controller is implemented using a modified replacement
policy. Since way-granularity is coarser than block-granularity, way-granularity offers
a smaller design space than block-granularity.
Since physical way-partitioning restricts cores from replacing another core’s blocks,
single-accessed blocks may pollute the cache for a long period of time. Logical waypartitioning alleviates this problem using a specially-designed replacement policy
[Jaleel et al. 2008b; Xie and Loh 2009], which achieves better cache utilization as
compared to physical way-partitioning. However, the cache partitioning control and
block lookup in physical way-partitioning is usually less complex than in logical waypartitioning. For example, a common physical way-partitioning implementation is column caching [Chiou et al. 2000], where each core is associated with a bit vector in
which the bits specify the replacement candidates. This bit vector does not introduce
a cache lookup penalty since the partitioned cache access is precisely the same as the
cache access in a conventional cache. In logical way-partitioning, block replacement
candidates may belong to other cores, thus determining the replacement candidate is
not as straightforward as in way-granularity cache partitioning.
Physical Way-Partitioning. Qureshi and Patt [2006] developed Utility-based Cache
Partitioning (UCP). A monitor tracked the cache misses for all possible number of ways
for each core using dynamic set sampling. The monitor maintained the tags of only a
few sampled sets, as if these sets were entirely allocated to one core. By associating
a hit counter with each way in the sampled sets, the marginal utility (reduction in
cache misses by having one more way) of each way for every core was calculated, and
the partitioning with the minimal total number of cache misses was selected. Greedy
and refined heuristics were used to determine the cache ways’ assignments to the
cores. Iyer [2004] classified the applications into different priorities according to the
application’s degree of locality and latency sensitivities. A single LLC was partitioned
on a set/way basis and could be divided into multiple small heterogeneous caches,
whose organization and policies were different. The OS allocated cache capacity to
each application based on the monitored performance and the application’s priority.
Varadarajan et al. [2006] divided the cache into small direct-mapped cache units,
which were assigned to each core for exclusive usage. In each cache partition, the
cache size, block size, and associativity were variable. Kim et al. [2004] developed cache
partitioning for fairness optimization using static and dynamic techniques. Static cache
partitioning utilized the stack distance profile of cache accesses to determine the cores’
A Survey on Cache Tuning from a Power/Energy Perspective
requirements. Dynamic cache partitioning was employed for each time interval. By
monitoring cache misses, a core’s cache partition was increased or decreased as long
as a better fairness could be obtained. If there was no performance benefit after one
time interval with the newly partitioned results, a rollback was invoked to revert to
the previous partitioning decision.
Each of these cache partitioning techniques were proposed to partition a shared
LLC. Lee et al. [2011] developed the CloudCache, which was an LLC physically composed of private caches, but the virtual private cache capacity for each core could be
increased/decreased using the physical private cache banks and the neighboring cores’
cache banks. Similarly to Qureshi and Patt [2006], the system metric tracking used
dynamic set sampling to track the cache misses with a different number of ways for
each core. A proximity-aware placement strategy was used such that the closest local
bank stored the MRU block, and the farthest remote banks stored the LRU block. In
contrast to partitioning a shared LLC to take advantage of a private cache’s features,
MorphCache [Srikantaiah et al. 2011] merged private caches to leverage shared cache
advantages. MorphCache dynamically merged private level-two and private level-three
caches and split the merged caches to form a configurable cache organization. The private caches could remain private or be merged as shared caches amongst 2, 4, 8, or
16 cores. If a core overutilized its private cache and another core underutilized its private cache, the two caches were merged to obtain a moderate utilization. In addition,
if two cores had a large amount of shared addresses, merging the two private caches as
one shared cache could reduce the data replication and coherence overhead. Huh et al.
[2007] partitioned the cache evenly into several partitions, where each partition was
shared by any number of cores. Coherence was maintained using a directory located at
the center of the chip, and the allocation was data proximity-aware. Liu et al. [2004] and
Dybdahl and Stenstrom [2007] also investigated proximity-aware cache partitioning.
Logical Way-Partitioning. In logical way-partitioning, the number of blocks used by
individual cores or shared by multiple cores in different cache sets may be different
and change in accordance with real-time accesses from different cores. Logical waypartitioning has a fine block-granularity.
Suh et al.’s [2004] cache partitioning technique used marginal utility estimation for
each core. Similarly to Qureshi and Patt [2006], dynamic set sampling tracked each
core’s LRU information to calculate the reduction in cache miss rate by adding one
more partition unit (cache way or block). A greedy algorithm determined the best cache
partition by identifying the core that received the greatest cache miss rate improvement
by adding one more partition unit and the core that suffered the least cache miss
rate degradation by removing one partition unit. If the improvement outweighed the
degradation, the core that received the most benefit was allocated one more partition
unit, and the core that suffered the least degradation gave up one partition unit.
Instead of simply partitioning cache ways similarly to column caching [Chiou et al.
2000], Suh’s cache partitioning controller modified the LRU replacement policy. If the
core’s actual cache occupancy was larger than the core’s block quota, the core’s LRU
block was replaced. Otherwise, the core could replace another core’s LRU block if the
other core over-occupied its quota.
Managing cache insertion enables implicit, pseudo-partitioning of a shared cache.
Jaleel et al. [2008b] logically partitioned a shared LLC by selecting the insertion position for incoming blocks from each core such that the lifetime of the blocks existing
in the cache was controlled by the insertion policy. This technique was based on the
dynamic insertion policy (DIP) developed by Qureshi et al. [2007]. In a conventional
LRU replacement policy, the new block is always inserted into the MRU position, which
is referred to as an MRU insertion policy (MIP). DIP uses both MIP and BIP (bimodal
W. Zang and A. Gorden-Ross
insertion policy). In BIP, incoming blocks are inserted into the MRU position with a
small probability, and the majority of the blocks are inserted into the LRU position and
have a short cache lifetime. The number of cache misses for each policy was monitored
using dynamic set sampling. Although DIP allowed each core to select between two
insertion policies, the number of sampled sets increased exponentially with the number of cores. The authors proposed an approach to independently determine the best
insertion policy for each core and then improved the approach by considering the interference of the insertion policy decision among the cores. Results showed that DIP had
a 1.3X performance improvement as compared to UCP [Qureshi and Patt 2006]. Xie
and Loh [2009] modified both the insertion policy and the rule of updating the MRU
stack in the cache sets. The incoming block could be inserted at an arbitrary position
depending on the core’s allocated cache capacity. For instance, the block was inserted to
a position close to the top of the stack (in a conventional LRU policy, the MRU position
is on the top of the stack) if the core was allocated a high cache capacity. On a cache hit,
the accessed block was moved up one position in the stack with a certain probability
instead of directly moved to the top of the stack.
In private LLCs, cooperative caching [Chang and Sohi 2006] is a technique that implicitly partitions the cache and attempts to store data in the on-chip cache as much as
possible. Although the caches were private, the equivalent capacity was comparable to
a shared cache. Rather than directly evicting blocks from a core’s private cache, other
cores’ private caches could store evicted blocks from another core if the cores’ cache had
spare capacity. In addition, in order to reduce data replication across the private LLCs,
blocks that had replicas in remote caches were selected for eviction before the LRU
block. According to the eviction control, the aggregate private LLCs, behaved similarly
to a shared cache with respect to the efficiency of capacity utilization. By controlling the
probability of retaining evicted blocks in remote caches and the probability of evicting
block replicas, the sharing portion could be adapted to the application’s requirements.
Beckmann et al. [2006] developed a hardware-based technique to dynamically determine the probability of evicting block replicas by keeping track of the number of hits to
the LRU blocks, replicas, and recently evicted blocks to estimate the benefit of increasing/decreasing the replica percentage. Qureshi [2009] proposed dynamic spill-receive
caching for private LLCs, where caches were designated as spiller caches whose evicted
blocks could be stored into another cache or as receiver caches that received the evicted
blocks. To determine if a cache should spill or receive, dynamic set sampling tracked
the cache misses such that in each private cache, one group of sampled sets acted as
always-spill and another group of sampled sets acted as always-receive. The decision to
spill or receive was made based on which group of sampled sets yielded fewer misses.
Even though some of the logical way-partitioning techniques may not be strictly
classified as way-partitioning, especially for the private LLC’s cooperative caching, we
classify these techniques as way-partitioning due to the use of a modified replacement
policy. Set-Partitioning. LLC set-partitioning can be implemented using either dedicated hardware or OS-based page placement. OS-based set-partitioning requires little
or no hardware support, thus is easily realizable in modern architectures. The OS statically places pages in cache banks or dynamically changes the page placement by migrating pages across banks with some extra hardware cost. OS-based set-partitioning
is on the OS page-granularity, and the page placement in an NUCA cache can be
proximity-aware by placing pages in the cache bank closest to the requesting core.
Hardware-Based Set-Partitioning. Srikantaiah et al. [2008] allocated the LLC’s sets
across the cores to reduce the cache misses caused by interference. By classifying
the cache misses as intracore misses (the missed block was evicted from the cache
A Survey on Cache Tuning from a Power/Energy Perspective
due to fetching another block from the same core) and intercore misses (the missed
block was evicted from the cache due to fetching another block from a different core),
the authors observed that most of the intercore misses were introduced by a few hot
blocks. Therefore, the authors designed set pinning, which prohibited the large number
of accesses to the few hot blocks to evict cache blocks by storing the hot blocks into small
cache regions owned privately by the cores that first accessed the block. In addition, a
set in the shared cache was assigned to the core that first accessed the set, and only
the core with ownership of the set could replace blocks in the set. Since this first-touch
allocation policy could result in unfairness, a fairer adaptive policy was adopted such
that cores could relinquish the ownership of sets if non-owner cores yielded a large
number of misses in the set.
OS-Based Set-Partitioning - Page Coloring. In conventional cache architectures,
cache blocks are uniformly distributed across cache banks based on the blocks’ physical addresses in order to improve the cache bandwidth and cache capacity utilization;
however, this block-granularity mapping may not be advantageous for multicore architectures. In multicore architectures, this mapping can significantly increase cache
access latency and network traffic, since accessing contiguous memory blocks traverses
most of the cache banks, and this mapping has no data proximity consideration for requesting cores. If the mapping is on the page granularity, consecutive blocks within
a page can reside in a single bank. OS-based page coloring [Kessler and Hill 1992]
enables full control of page placement in the cache banks using virtual-to-physical
mapping. The placement of a virtual page is dictated by the page’s assigned physical
page address. In an OS-based set-partitioning cache, sets are distributed across banks,
and an entire set resides in one bank. Therefore, in cache addressing, the few most
significant bits in the set index dictate the bank, and the remaining bits dictate the set
location in the bank. In page coloring, each bank is associated with one color and a free
list keeps track of the available pages in each bank. When a new page is required, the
OS determines the optimal bank for the page and allocates a free page from the list by
creating a virtual-to-physical mapping.
Cho and Jin [2006] proposed OS-based set-partitioning for shared LLCs. In multithreaded applications, where the accessed data in each core is mostly private, when
a new page was requested, the page was directly allocated to the requesting core’s
closest bank and served as a private cache, thus providing proximity-aware caching.
For multithreaded applications, where a page was shared by multiple cores, the shared
page was allocated as a single instance in one bank, and the OS determined the optimal proximity bank based on the average distance to the page’s requesting cores. If
the local bank was too small to store the owner core’s application, subsequent page
allocations for that core could borrow cache capacity from neighboring cache banks.
Similarly, if there were many actively accessed pages in one bank, which meant the
bank was suffering heavy cache pressure, spreading pages to neighboring cache banks
was necessary.
In the first-touch page allocation, the page placement does not change dynamically
during a phase. When a phase change is detected (Section 3.4), especially when the application running on one core migrates to another core or the page sharing changes dramatically, page remapping is required. However, migrating pages requires copying the
pages to main memory, which introduces high performance and energy costs. Awasthi
et al. [2009] developed a technique to eliminate high page migration costs by preserving the page’s original physical memory location. Since accessing the pages should use
new addresses, the TLB did not directly map the virtual address to the original physical address and instead produced the new address using a translation table. When
page migration occurred due to remapping, first the page in the cache was flushed and
W. Zang and A. Gorden-Ross
then re-loaded into the cache with the page’s new addresses. To reduce the large space
required by the translation table, Lin et al. [2009] used a coarser granularity that
remapped an entire cache region consisting of multiple physical pages. Hardavellas
et al. [2009] used the OS to classify the pages into three categories: shared instruction
pages that were replicated in each core’s local banks without any coherence mechanism,
private data pages that were directly placed in the local cache banks of requesting
cores, and shared data pages that were kept as one instance in a single bank to avoid
coherence overhead and multiple pages could be equally distributed across all banks.
3.3.3. Summary. Data address sharing and shared resource contention in multicore
architectures presents numerous cache tuning challenges. In this section, we reviewed
efficient design space exploration techniques for heterogeneous first-level private
caches with varying cache size, block size, and associativity. Since shared data interference and coherence in multicore architectures complicates cache tuning, cache tuning
heuristics for these architectures must also consider tuning the cache organization,
designating sharing and private partitions in the last-level cache, and proximity-aware
partitioning and allocation of cache partitions to each core in NUCA caches to maximize
performance and minimize energy consumption. We reviewed tuning techniques at the
cache-partitioning level, including both explicit and implicit partitioning techniques,
such as modified replacement and insertion policies for shared caches and cooperative
caching for private caches. Unlike conventional cache partitioning techniques, where
each partitioned portion is exclusively used by one core, the reviewed cache partitioning
techniques allowed partitions to be optionally shared by any number of cores. However,
although cache optimization for multicore architectures has been investigated, cache
tuning that combines last-level cache partitioning and first-level cache configuration
is still an open research topic. Additionally, cache tuning for SMT processors while
considering other resource contentions has not been researched thoroughly.
Table III summarizes the cache partitioning techniques reviewed in Section 3.3.2.
The first column lists the target cache architectures to be partitioned, and the second,
third, and fourth columns give the partitioning techniques, the benefits, and partition
granularity, respectively, for each architecture. Although we reviewed the techniques
with respect to way- and set-partitioning, we summarize the techniques based on a
different aspect. Since the main objective of cache partitioning is to strengthen the
private and shared cache organizations to achieve the collective benefits of both types
of cache organizations, we summarize the techniques with respect to shared and private
LLCs, with OS-based page coloring summarized separately.
3.4. Phase-Change Detection
An application phase is defined as the set of intervals within an application’s execution
that have similar cache behavior [Sherwood et al. 2003b]. In order to adapt the cache
configuration to different application phases during runtime, online cache tuning must
detect and react to phase changes to determine the new optimal cache configuration for
the next phase. Accurately detecting the precise phase change moment is both critical
and difficult [Gordon-Ross and Vahid 2007]. If the phase-change detection mechanism
is not sensitive enough and the next phase’s optimal cache configuration provides significant energy savings, missing the phase-change will waste energy because the system
will execute in a suboptimal configuration. Alternatively, if the phase-change detection
is too sensitive, the overhead due to too frequent cache tuning may consume more
energy than simply running in a fixed base cache configuration for the application’s
entire execution.
Phase-change detection can be performed either online or offline. Online phasechange detection has been the focus of much online algorithm research [Dhodapkar
A Survey on Cache Tuning from a Power/Energy Perspective
Table III. Summary of Cache Partitioning Techniques
Shared LLC
Partitioning techniques
Allocate physical ways/sets to each 1. Isolated capacity for each core avoids
shared cache contention.
2. Adjustable “private” cache
Modified replacement and
1. Better utilization of cache capacity by
insertion policies.
evicting unused blocks that belong to
other cores.
2. Occasional long lifetime of some
blocks helps reduce cache misses
(since in some techniques, each
incoming block’s insertion location is
nondeterministic, which is dictated
with a probability).
Private LLC Physically shrink/expand private 1. Adjust private cache size/associativity
by borrowing from other cores’ cache
Physically merge several private 1. Reduces shared address replication.
caches into a shared cache.
2. Avoids coherence overhead.
Cooperative caching:
1. Reduces shared address replication.
Retain evicted blocks in other
2. Better cache capacity utilization.
cores’ caches.
Replace the block with replicas in
other cores’ caches instead of
replacing the LRU block.
Allocate private pages to the
1. Data proximity-aware to reduce hit
page coloring
requesting core’s closest banks.
Replicate read-only shared pages 2. Isolated capacity for each core avoids
to the requesting core’s closest
shared cache contention.
cache banks.
3. Better cache capacity utilization.
Shared data pages are stored as
4. Reduces coherence overhead
one instance in a single bank
5. No hardware complexity overhead but
that is close to all the requesting
requires OS modification.
OS page
and Smith 2002; Gordon-Ross and Vahid 2007; Huang et al. 2008; Huang et al. 2003;
Sherwood et al. 2003a]. Online algorithms process input as the input arrives over
time, thus online algorithms do not have knowledge of all of the inputs at the start of
execution. Online phase change detection predicts future phase changes and phase durations (the length of time between phase changes) based on current and past system
behavior. Some configurable cache architectures have integrated-hardware for online
phase change detection. The phase change detection hardware monitors system behavior and performance and initiates cache tuning based on the changes in the observed
behavior. Since cache tuning is triggered after a variation in system behavior and performance is observed, online phase change detection is a reactive technique. During
the reaction time (the period of time before the system metrics reflect the behavior
change), the cache configuration remains the same as the cache configuration from the
previous phase, which wastes energy. If an application’s behavior has significant variability, reactive phase-change detection will suffer from continuously lagging behind
the phase change. Thus, the optimal system is an oracle that immediately recognizes
phase changes and then triggers the cache tuning. Alternatively, offline phase change
detection is a pro-active technique. Designers use offline phase change detection analysis tools to determine the phase changes (typically denoted by particular instructions)
and then provide these predetermined phase changes to the cache-tuning hardware
to invoke cache tuning. Designers can also leverage offline cache-tuning techniques to
W. Zang and A. Gorden-Ross
statically determine the optimal cache configuration for each predetermined phase and
provide these predetermined configurations to the cache-tuning hardware, which eliminates runtime design space exploration. One possible implementation is to leverage
a compiler or linker to insert special instructions into the application binary at each
phase change to automatically update the cache configuration registers and reconfigure
the cache.
Application execution is analyzed for a phase change at either a fixed or variable
interval. During each interval, the phase analysis technique collects system metrics
and analyzes these metrics at the end of each interval to determine if a phase change
has occurred. Previous works in fixed-interval analysis explored intervals ranging
from 100,000 [Balasubramonian et al. 2000] to 10 million instructions [Sherwood et al.
2003a, 2003b] or from 10 milliseconds to 10 seconds [Duesterwald et al. 2003]. Finding
the best interval length for an application with a particular input stimuli is challenging
except in situations where the interval length is not critical, such as long and stable
phases. Some phase-change detection techniques [Huang et al. 2003; Shen et al. 2004;
Lau et al. 2006] do not use fixed intervals, but instead react directly to variations in
system metrics or application structure.
Previous work revealed that applications tended to revisit phases [Sherwood et al.
2003b], which resulted in a technique called phase classification. Phase classification
classifies intervals based on the interval’s system metrics and same-class intervals have
the same optimal cache configuration. Based on the execution order of past phases, future intervals can be predicted as belonging to a particular phase class using Markov
[Sherwood et al. 2001] or table-driven predictors [Dropsho et al. 2002a; Duesterwald
et al. 2003; Sherwood et al. 2003a]. Both phase classification and phase prediction enables predetermination of the optimal cache configuration for future intervals, thereby
eliminating cache tuning for same-class phases.
The typical system metrics used to detect phase changes include system performance
(e.g., cache misses, cache access delays, IPC (instructions per cycle), and system power),
application structures (e.g., number of branches, looping patterns, number of instructions within a loop, working set analysis counters), and memory accesses. System
performance-based phase-change detection leverages a system performance monitor to
track any performance degradation incurred by using the current cache configuration
to determine whether a new cache configuration is required. This detection technique
is generally used for online phase-change detection using custom-designed hardware
for runtime system performance monitoring. Additionally, this detection technique can
also be performed offline using an offline profiler for system performance analysis. Application structure-based phase-change detection marks a subset of loops and functions
as basic elements, and divides the elements into phases according to statistical characteristics. This detection technique is more complicated than system performance-based
phase-change detection due to a typically large amount of complex loops and functions.
However, unlike system performance-based phase-change detection that depends on
detecting system metric changes, application structure-based phase-change detection
depends on the variation in the application code, which potentially produces input
set-, system architecture-, and optimization objective-independent phases [Lau et al.
2006]. Application structure-based phase-change detection can be used both online and
offline. Memory access-based phase-change detection leverages the reuse distance of
memory accesses to detect phase changes. Since the cache behavior directly depends on
the memory access pattern, memory access-based phase-change detection is particularly effective for cache tuning, but can only be used for offline phase-change detection
due to complicated memory access analysis.
In the following sections, we present previous works in online and offline phasechange detection for single-threaded applications and multithreaded applications.
A Survey on Cache Tuning from a Power/Energy Perspective
Current metric
History table
System execuation
Phase change
Fig. 6. Overview of phase-change detection.
3.4.1. Online Phase-Change Detection. Online phase-change detection is autonomous
and transparent to designers. Phase-change detection is designed to dynamically detect phase changes and trigger cache tuning. Figure 6 depicts an overview of one
possible hardware solution for phase-change detection. During system execution, a
system metric, which records system performance or application structure characteristics, is accumulated during a fixed or variable interval. The accumulator consists of
one or more system metric counters (shown as buckets divided by dash lines in the
figure). The number of counters depends on the dimensionality used to represent the
system metric. A history table maintains past system metric history from all previous
intervals [Dhodapkar and Smith 2002; Sherwood et al. 2003a] or from only the last
interval [Balasubramonian et al. 2000]. The system metric accumulated for each interval or phase experienced previously is recorded in one column of the history table.
The phase-change detection unit compares the current system metric with the system
metric’s history to predict phase changes [Balasubramonian et al. 2000] or perform
phase classification [Dhodapkar and Smith 2002; Sherwood et al. 2003a, 2001] for future intervals. A system metric variation threshold is set to tolerate intra-phase system
metric fluctuations to avoid triggering cache tuning too frequently (i.e., no system metric remains precisely constant within a phase, and thus intraphase fluctuations should
not trigger cache tuning). When a phase change is detected, the cache-tuning hardware
is triggered and tunes the cache based on the collected system metric or phase classification results. Some simpler phase-change detectors [Dropsho et al. 2002a; Powell
et al. 2001] do not use a system metric history table and instead compare the current
system metric with a predetermined bound, and cache tuning is triggered when the
system metric exceeds the bound.
Several cache architectures for online cache tuning discussed in Sections 3.2 and
3.3 have integrated online phase-change detection, which we will elaborate on in
the remainder of this section. Additionally, we will review recently proposed system
performance- and application structure-based phase change detection techniques.
System Performance-Based Phase-Change Detection. In Powell’s DRI cache [Powell
et al. 2001], the cache miss rate was compared to a cache miss rate bound at a fixed
interval to detect phase changes and resize the cache. If the cache miss rate exceeded
the cache miss rate bound, additional cache partitions were activated, and if the cache
miss rate fell below the cache miss rate bound, cache partitions were deactivated.
Dropsho et al. [2002a] calculated and maintained delay counters that recorded the
delays in accessing the partitioned primary and secondary ways of the cache, which
indicated phase changes. The delay tolerance threshold was fixed and predetermined.
W. Zang and A. Gorden-Ross
Balasubramonian et al. [2000] detected phase changes using hardware counters to collect cache miss rates, IPC, and branch frequencies for each fixed interval. Successive
intervals were compared by calculating differences in the number of cache misses and
branches. These differences were compared with a threshold to detect a phase change.
If the current interval’s behavior was sufficiently different from the last interval’s
behavior, cache tuning was triggered. Even though this phase-change detection technique could dynamically adjust the threshold, selecting the threshold’s initial value
was difficult for guaranteed convergence.
Application Structure-Based Phase-Change Detection. Sherwood et al. [2003a] used
basic block vectors (BBVs) to detect and classify phases. An accumulator captured
branch addresses and the number of instructions executed between branches. At the
end of a fixed interval, the branch information was compared with past phases for
phase classification, and the interval’s phase classification was recorded in a past
footprint table. After detecting the phase changes, the next phase’s duration (phase’s
can span multiple intervals) was predicted using a Markov predictor. In phase-based
cache tuning, each phase’s optimal cache configuration was stored in a table so that
when the phase was reentered, the cache could be configured directly to the previously
recorded optimal cache configuration. Dhodapkar and Smith [2002] proposed hardware
for detecting and classifing phases by tracking instruction working set signatures (i.e.,
a sample of instruction addresses accessed during a fixed interval), and a phase change
occurred when the working set changed. Dhodapkar and Smith [2003] also compared
their technique with the BBV technique [Sherwood et al. 2001] and concluded that
the instruction working set technique was more effective in detecting major phase
(whose duration is larger than a preset threshold) changes, while the BBV technique
provided higher sensitivity and more stable phase classification. Since between long
stable phases there is usually a short transition phase that exhibits unique behavior,
Lau et al. [2005] designed phase prediction hardware that identified transition
phases. Duesterwald et al. [2003] observed that the phase periodic behavior was
shared across metrics and proposed a cross-metric predictor that efficiently integrated
multiple metric predictors to predict phase changes. Huang et al. [2003] proposed
stack-based hardware to identify application subroutines by tracking procedure calls.
A function-call stack tracked subroutine entries and returns to indicate the execution
time spent on each subroutine. If the execution time spent in a subroutine was greater
than a threshold value, the subroutine was identified as a major phase.
3.4.2. Offline Phase-Change Detection. Reactive techniques, such as online phase-change
detection, are effective when applications exhibit stable behavior/phases across several
successive intervals, because the overhead of cache tuning or lagging phase change detection can be amortized over the long, stable phase. However, applications that exhibit
significant variability (frequent phase changes and/or short phases) are not amenable
to reactive techniques, because the phase-change detection process and cache tuning
may consume the entire phase, leaving no time to execute in the optimal configuration.
In addition, online phase change detection may require additional hardware overhead,
resulting in increased execution time and energy consumption.
Alternatively, offline phase change detection is a proactive technique, because offline
analysis can analyze the entire application and pinpoint the phase changes such that
cache tuning can be triggered exactly when the phase change occurs. Since an oracle
leverages offline phase-change detection, all future interval behavior is known a priori,
and runtime overheads are reduced, such as system metric collection and storage, and
does not interfere with application execution. However, offline phase-change detection
requires significant design time analysis and is typically based on trace/event profiling,
A Survey on Cache Tuning from a Power/Energy Perspective
which may be input-dependent, except for some application structure-based phasechange detection techniques [Lau et al. 2006].
In this section, we present previous works in offline phase-change detection categorized as system performance-, application structure-, and memory access-based.
System Performance-Based Phase-Change Detection. System performance-based
phase change detection is a common online phase change detection technique (Section 3.4.1.) that can also be leveraged offline with little modification. Offline performance profilers track performance changes with respect to system metrics, and an
offline analysis tool analyzes these system metrics to detect phase changes similarly
to the online techniques.
Application Structure-Based Phase-Change Detection. Sherwood et al. [2001] evaluated applications’ periodic behavior based on basic block distribution analysis. This
analysis recorded the number of times each basic block was executed and then determined phase changes. In subsequent work, Sherwood et al. [2003b] used BBVs to detect
and classify phase changes offline and chose a single representative interval (an interval
that best represents the phase behavior) for each phase to use as an offline simulation
point for the phase. Additionally, the authors developed an accurate and efficient tool,
SimPoint [Hamerly et al. 2005], which has been widely used in offline phase analysis
[Benitez el al. 2006; Gordon-Ross et al. 2008; Sherwood et al. 2003a; etc.]. Lau et al.
[2004] investigated the trade-offs between using applications structures with BBVs,
loop branches, procedures, opcodes, register usage, and memory addresses for phase
classification and concluded that using register usage vectors and loop vectors were as
efficient and effective as using BBVs. By using the application’s procedures and loop
boundaries, Lau et al. [2006] selected software phase markers to signal phase changes
and inserted these markers into the application with a static or dynamic compiler. The
authors also integrated the software phase markers into SimPoint to create variable
length intervals, and generated simulation points that mapped to the application code,
which could be reused across different inputs and compilations for the same application
Memory Access-Based Phase-Change Detection. Application structure-based phasechange detection can be inaccurate if different iterations of the same loop have significantly different behavior due to different input data/stimuli. Alternatively, phase
change detection based on a memory access trace can increase the phase-change detection accuracy for cache tuning, since the reuse distance of memory accesses directly
dictates the cache miss rates according to the discussion in Section Shen et al.
[2004] used reuse distance patterns to detect phase changes using wavelet filtering.
Ding and Zhong [2003] studied an application’s reuse distance patterns for different
inputs, and Shen et al. [2005] developed phase-based cache miss rate prediction for
different application inputs by building a regression model.
3.4.3. Phase Change Detection for Multithreaded/Multicore Architectures. In SMT and CMP
architectures, overall-phase (a phase in a multithreaded/multicore applications’ overall
execution) detection and classification are more complex than in single-threaded singlecore processors, since changing the cache parameters may change the coscheduling of
applications, leading to different overall cache behavior. Furthermore, in real systems,
it is unlikely that two applications will begin execution at the same time due to different
scheduling times for each application. Thus, the coscheduled portions of concurrent
applications will dynamically change with the OS scheduling during runtime.
Previous works on overall-phase change detection and classification for SMT and
CMP architectures are based on the classic BBV technique [Sherwood et al. 2003b] for
single-cores. Biesbrouck et al. [2004] designed a cophase matrix for SMT architectures.
ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.
W. Zang and A. Gorden-Ross
The authors stated that since the single-threaded phase behavior was still valid in
multithreaded execution with shared resource contention, the BBV technique could be
directly leveraged for phase analysis for each single-threaded application. If any thread
contained a phase change, the overall-phase changed. The authors used a cophase matrix to record all combinations of each single-threaded phase’s representative interval
when multiple threads executed simultaneously. The change in execution time for a
single-thread’s phase due to co-execution with other threads was calculated based on
changes in the IPC. Perelman et al. [2006] detected phases for parallel applications
on CMPs by concatenating the interval traces for each thread and determining the
overall-phase behavior across threads using a BBV technique. Using this technique,
the intervals for different threads could be classified as the same overall-phase, and
this overall-phase information was mapped back to the parallel execution. The stalls
introduced by synchronization were considered when forming the fixed-length intervals
such that the interval did not cross a synchronization point. Similarly to Biesbrouck
et al. [2004], a phase change in any single thread indicated an overall-phase change.
However, the number of overall-phases and corresponding representative intervals
were reduced. To further combine the overall-phases, Namkung et al. [2006] classified
the phase combinations recorded in the cophase matrix using the Levenshtein distance to calculate similarity. Kihm and Connors [2005] provided a mathematical model
to correctly weight and accumulate each of the thread phases for overall performance
evaluation by considering the changes in overall-phases with respect to different start
time offsets for the single-thread phases.
The majority of previous works on overall-phase analysis were based on application
structure-based phase change detection (i.e., BBVs [Sherwood et al. 2003b]), because
although shared resource contention impacts the performance and timing-related behavior of single-threads, the application itself is nearly unaffected, except for synchronization stalls and varied phase lengths. Therefore, the single-thread phases determined by application structure-based techniques are likely valid when co-executing
with other threads. However, extending previous works to consider data interferences,
sharing, and coherence between threads is nontrivial.
3.4.4. Summary. Application- and phase-based cache tuning must determine the optimal (near optimal) cache configuration quickly and accurately, since the energy overhead incurred during cache tuning can be significant while evaluating suboptimal
configurations. A quick cache tuning time is particularly important for phase-based
cache tuning, since a phase’s execution may be short and cache tuning must complete
quickly enough to amortize the energy overhead incurred during tuning.
In phase-based cache tuning, phase-change detection can be transparently incorporated into configurable cache architectures to trigger cache tuning for different phases.
We reviewed phase-change detection techniques from an online and offline perspective,
wherein the phase changes are detected using three techniques: system performance-,
application structure-, and memory access-based. Table IV summarizes these three
techniques: the first column lists the techniques, the second column gives the characteristics of each technique, and the third and fourth columns indicate whether the
techniques can be used online or offline, respectively.
For phase-based cache tuning, the phase-change detection technique must identify
phases with an appropriate granularity based on application behavior. If the granularity is too fine, significant overhead can be incurred during redundant cache tuning
and if the granularity is too coarse, phase changes may not be detected, and overhead
can be incurred while the system executes in suboptimal configurations. Robust phasechange detection techniques can vary the phase granularity during runtime by varying
system metric thresholds, which dictate the similarity between adjacent phases. Even
A Survey on Cache Tuning from a Power/Energy Perspective
Table IV. Summary of Phase Change Detection Techniques
System performance-based
Application structure-based
Memory access-based
1. Monitor system performance to detect phase
2. Input dependence for offline usage.
1. Analyze application loops and functions to
detect phase changes.
2. Generally more complicated than system
1. Use reuse distance of memory accesses to
detect phase changes.
2. More effective for cache tuning than application
3. Input dependence for offline usage.
though most techniques use fixed thresholds, where the threshold is predetermined for
an application using an offline profiler, dynamically adaptive thresholds are developed
and outperform fixed, predetermined thresholds. However, the initial value selection
for dynamically changed thresholds remains challenging. Hind et al. [2003] provided
an analysis on defining granularity and similarity for a phase and indicated that different granularity and similarity values may produce significantly different phase change
detection results.
The overall-phase change detection and classification for SMT and CMP architectures are mainly based on the techniques for single-threaded single-core architectures.
Since multiple threads can co-execute different application portions at the same time,
the combination of any phase from each single-threaded application may occur as one
overall-phase. Since the number of phase combinations grows exponentially with the
number of threads, previous works mainly focused on reducing the number of overallphase classes using techniques to globally classify phases across threads and merge
the phase combinations.
The cache tuning techniques reviewed in the previous sections, including offline simulations, analytical modeling, and online cache monitoring, only evaluated cache performance based on statistics, such as the number of cache accesses, cache misses, write
backs, and additional latencies represented by the CPI to determine the cache’s energy
consumption. The cache performance is determined directly from the system architecture and the input datasets. However, since the main goal of cache tuning is to optimize
the cache configuration to minimize the overall system energy consumption, only evaluating the cache performance is insufficient. There exists much research on circuit and
VLSI power and energy consumption analysis; however, our survey does not review
these techniques and instead focuses on architectural-level power/energy quantitative
techniques and widely used estimation tools.
Different cache configurations with varying cache sizes consume different cache leakage power (assuming that the fabrication technique and supply voltage is fixed). In addition, different cache configurations produce diverse dynamic energy consumption for
each cache read/write. Cache misses introduce memory accesses and CPU stalls, whose
power consumption should be considered in addition to the cache configuration effects.
The cache leakage power, cache dynamic energy per read/write, cache latency per access, and the number of cache accesses and misses are used to model the cache-related
energy consumption [Zhang et al. 2004; Dropsho et al. 2002a] for in-order cores. However, for more complicated contemporary cores, which leverage various techniques to
W. Zang and A. Gorden-Ross
hide cache/memory access latencies (e.g., out-of-order execution, instruction-level parallelism, memory-level parallelism, etc.), a comprehensive energy model is difficult to
construct. These same challenges also arise for multithreaded/multicore architectures.
Caches are constructed using regular SRAM arrays, thus the power is easily estimated based on the cache size and organization. Cacti [Tarjan et al. 2006] is a widely
used memory hierarchy simulation tool that provides not only the cache leakage power
and dynamic energy per read/write, but also the area and access latency. Cacti is based
on analytical modeling, in which the power/energy is calculated using parameterized
equation models. The Wattch tool, such as SimplePower [Brooks et al. 2000] used with
SimpleScalar [Austin et al. 2002], provides power consumption for the entire processor. SimplePower’s cache modeling is based on Cacti. Wattch only models dynamic
power consumption, which could be compensated by integrating the HotLeakage package [Zhang et al. 2003]. HotSpot [Huang et al. 2006] provides an architecture-level
temperature model based on compact thermal models and stacked layer packaging by
considering the heat flow and electrical phenomena. SimWatch [Chen et al. 2007] integrates Simics [Magnusson et al. 2002] and Wattch in a full-system simulation environment. SimWatch can evaluate the power efficiency of microarchitectures, applications,
compilers, and operating systems. SESC [Renau et al. 2005] models the power for multithreaded processors and the cache hierarchy based on Wattch and Cacti, respectively,
and models the temperature using SESCSpot based on HotSpot. IBM PowerTimer
[Brooks et al. 2003] simulates the power for an entire processor using a simulator based
on empirical techniques. PowerTimer uses the modules’ power consumptions as an existing reference processor to predict the desired architectural model by scaling by an appropriate factor. The scaling factor for cache power prediction is dictated by a sophisticated power effect analysis based on the changes in cache parameters and organization.
Cache energy consumption calculation based on simulation tools can be employed
in both offline and online cache tuning. In offline cache tuning, the cache leakage
power and dynamic energy per read/write for all cache configurations are determined
and stored offline. During online cache tuning, the cache performance (accesses and
misses) is monitored and preserved using hardware counters. In order to adapt the
cache to minimize the energy consumption at runtime, the system’s normal execution
may be impeded by the hardware/software cache energy calculation. A direct and simple
technique is to measure the energy consumption for the entire system during runtime
using an on-chip temperature sensor [Sanchez et al. 1997], but only online cache tuning
can leverage this technique.
Cache tuning plays an important role in reducing cache power/energy consumption,
thereby resulting in large total microprocessor system power/energy reduction. We have
surveyed various cache tuning techniques based on two categories, offline static cache
tuning and online dynamic cache tuning, ranging from hardware support to cache tuning techniques, coupled with phase-change detection techniques. The tuning techniques
adjust cache parameters and organization in accordance with changing cache behavior and/or system performance, which is obtained from different approaches, such as
system execution, system simulation, memory access trace simulation/profiling, or application structure analysis. Given complex modern systems, very large design spaces,
and diverse applications, different techniques have their own merits, limitations, and
challenges for different circumstances. Therefore, it is difficult to classify a single,
paramount cache tuning technique.
This survey focuses on cache tuning to minimize energy consumption by varying four
main cache parameters: total size, block size, associativity, and cache organization in
multicore architectures. In addition to continually developing new architectures and
A Survey on Cache Tuning from a Power/Energy Perspective
techniques in this area, researchers have extended cache tuning techniques to other
storage components, such as issue queues, register files, TLBs, and buffers. Other topics related to cache tuning include: cache compression, shared cache bandwidth partitioning, cache interconnection network optimization, tuning the cache to trade off the
power/energy minimization and system performance optimization, and DVFS (dynamic
voltage and frequency scaling) to manage system dynamic power. Additionally, emerging techniques in integrated circuit design and three-dimensional (3D) die-stacking
have been proposed to solve the high area overhead problem of SRAM; however, a
review of these related topics is beyond the scope of this survey.
Even though cache tuning has been investigated extensively, many open challenges
still remain, which we summarize as follows.
(1) Even though a large number of simulators have been developed and each simulator offers different coverage, accuracy, and simulation speed, a specially-designed,
accurate, fast cache simulator for an arbitrary cache hierarchy is still needed.
(2) Trace-driven cache simulation is fast, especially using single-pass techniques to
evaluate multiple configurations, thus providing efficient offline cache tuning.
The main challenge is to capture the dynamic and timing-dependent behavior for
simulations of out-of-order processors and multithreaded/multicore architectures.
(3) For cache tuning aimed at minimizing system power/energy consumption, calculating system power/energy consumption based on the cache misses acquired
from offline simulation or online software/hardware profiling is challenging due
to increasing interrelated factors and system complexity.
(4) Although a heuristic search of the design space can dramatically reduce the cache
tuning time, heuristic techniques are difficult to design especially for complex
architectures. Since heuristics are based on empirical experiences, which may
depend on the particular cache hierarchy and organization, designing a versatile
and generalized heuristic for arbitrary cache hierarchies and organization is still
an open research area.
(5) For multilevel and multicore cache tuning, decoupling the cache interference between levels and cores can reduce the cache tuning complexity.
(6) Online cache tuning is a prominent future direction, since online cache tuning
can dynamically react to changing execution and does not require any design
time effort. Existing online cache tuning and phase-change detection techniques
introduce hardware overhead and are intrusive to normal system execution. Reducing the impact of cache tuning and phase change detection on performance
and energy consumption is critical.
(7) Cache tuning considering data proximity, wire-delay, and on-chip network traffic
is necessary for both performance and energy optimizations and is an open topic.
(8) Cache tuning considering additional resource contentions in simultaneousmultithreaded (SMT) processors has not been addressed completely.
(9) There is little research in tuning the entire cache subsystem, especially for multicore architectures.
(10) Currently, the major targets of cache partitioning are overall performance improvement and fairness guarantees. Combining cache partitioning and leakage
power reduction techniques (dynamically deactivating portions of the cache using way/bank and block/set shutdown) to obtain both performance and energy
optimizations is an interesting topic.
(11) Integrating operating system page coloring with other cache partitioning techniques, such as modifying the replacement and insertion policies and cooperative
caching, can enhance the performance and energy optimization, which has not
been thoroughly investigated.
W. Zang and A. Gorden-Ross
(12) Precisely detecting a phase change is challenging, since online phase-change
detection is reactive and proactive offline phase change detection is generally
bounded by the inputs used during offline analysis.
(13) Existing phase-change detection techniques for multithreaded/multicore architectures assume that the phases determined for a single-threaded application are
valid when the application co-executes with other applications. The validation of
this assumption for multicore applications with high intercore dependency and
interaction is necessary.
(14) Emerging techniques developed in integrated circuit design for cache bring new
challenges to cache tuning.
