A Survey on Cache Tuning from a Power/Energy Perspective WEI ZANG and ANN GORDON-ROSS, University of Florida Low power and/or energy consumption is a requirement not only in embedded systems that run on batteries or have limited cooling capabilities, but also in desktop and mainframes where chips require costly cooling techniques. Since the cache subsystem is typically the most power/energy-consuming subsystem, caches are good candidates for power/energy optimizations, and therefore, cache tuning techniques are widely researched. This survey focuses on state-of-the-art offline static and online dynamic cache tuning techniques and summarizes the techniques’ attributes, major challenges, and potential research trends to inspire novel ideas and future research avenues. Categories and Subject Descriptors: A.1 [General Literature]: Introductory and Survey; B.3.2 [Memory Structures]: Design Styles—Cache memories; B.3.3 [Memory Structures]: Performance Analysis and Design Aids General Terms: Design, Algorithms, Performance Additional Key Words and Phrases: Cache tuning, cache partitioning, cache configuration, power saving, energy saving ACM Reference Format: Zang, W. and Gordon-Ross, A. 2013. A survey on cache tuning from a Power/energy perspective. ACM Comput. Surv. 45, 3, Article 32 (June 2013), 49 pages. DOI: http://dx.doi.org/10.1145/2480741.2480749 1. INTRODUCTION In addition to continuously advancing chip technology, today’s processors are evolving towards more versatile and powerful designs, including high-speed microprocessors, multicore architectures, and systems-on-a-chip (SoCs). These trends have resulted in rapidly increasing clock frequency and functionality, as well as increasing energy consumption. Unfortunately, high energy consumption can exacerbate design concerns, such as reducing chip reliability, diminishing battery life, and requiring high-cost packaging and cooling techniques. Therefore, low-power processor design is essential for continuing advancements in all computing domains (e.g., personal computers, servers, battery-operated portable devices, etc.). Among all processor components, the cache and memory subsystem generally consume a large portion of the total microprocessor system power. For example, the ARM 920T microprocessor cache subsystem consumes 44% of the total power [Segars 2001]. Even though the Strong ARM SA-110 processor specifically targets low-power applications, the processor’s instruction cache consumes 27% of the total power [Montanaro This work is supported by the National Science Foundation under grant CNS-0953447. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Authors’ addresses: W. Zang and A. Gordon-Ross, Department of Electrical and Computer Engineering, University of Florida, P. O. Box 116200, Gainesville, FL. 32611; corresponding author’s; email: weizang@ufl.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2013 ACM 0360-0300/2013/06-ART32 $15.00 DOI: http://dx.doi.org/10.1145/2480741.2480749 ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32 32:2 W. Zang and A. Gorden-Ross et al. 1997]. The 21164 DEC Alpha’s cache subsystem consumes 25% to 30% of the total power [Edmondon et al. 1995]. In real-time signal processing embedded applications, memory traffic between the application-specific integrated circuit (ASIC) and the off-chip memories constitutes 50% to 80% of the total power [Shiue and Chakrabarti 2001]. These power consumption statistics indicate that the cache is an ideal candidate for power/energy reductions. In this survey, we focus on cache power/energy-consumption reduction techniques. Power is consumed in two fundamental ways: statically and dynamically. Static power consumption is due to leakage current, which is present even when the circuit is not switching, whereas dynamic power consumption is due to the switching activities of capacitative load charging and discharging on transistor gates in the circuit. Since the cache typically occupies a significant fraction of the total on-chip area (e.g., 60% in the Strong ARM [Montanaro et al. 1997]), scaling down the cache size or putting a portion of the cache into a low leakage mode potentially provides the greatest opportunity in static power reduction. Alternatively, since the internal switching activity due to cache accesses (reads/writes) and main memory accesses (fetching data to the cache) largely dictates dynamic power consumption, reducing the cache miss rate provides the greatest opportunity in dynamic power reduction. In addition, reducing the cache miss rate reduces the application’s total number of execution cycles (improved performance), which also reduces the static energy consumption. Therefore, cache optimization techniques that both reduce the cache size and miss rate are critical in reducing both static and dynamic system power. The main cache parameters that dictate a cache’s energy consumption are the total size, block size, and associativity [Zhang et al. 2004]. Application-specific requirements and behavior dictate the energy consumption with respect to these parameters. Different applications exhibit widely varying cache parameter requirements [Zhang et al. 2004]. For example, the cache size should reflect an/a application’s/phase’s working set size. If the cache size is larger than the working set size, excess dynamic energy is consumed by fetching blocks from an excessively large cache, and excess static energy is consumed to power the large cache size. Alternatively, if the cache size is smaller than the working set size, excess energy is wasted due to thrashing (portions of the working set are continually swapped into and out of the cache due to capacity misses). Spatial locality dictates the appropriate cache block size. If the block size is too large, fetching unused information from main memory wastes energy, and if the block size is too small, high cache miss rates waste energy due to frequent memory fetching and added stall cycles. Similarly, temporal locality dictates the appropriate cache associativity. Adjusting cache parameters based on applications is appropriate for small applications with little dynamic behavior, where a single cache configuration can capture the entire application’s temporal and spatial locality behaviors. However, since many larger applications show considerable variation in cache requirements throughout execution [Sherwood et al. 2003b], a single cache configuration cannot capture all temporal and spatial locality behaviors. Phase-specified cache optimization allows for cache parameters and organization to be configured for each execution phase (phase is the set of intervals within an application’s execution that have similar cache behavior [Sherwood et al. 2003b]), resulting in more energy savings than application-based cache tuning [Gordon-Ross et al. 2008]. In multicore cache architectures, the cores’ cache parameters collectively dictate the total energy consumption. In heterogeneous multicore architectures, each core can have different cache parameters, and in homogenous multicore architectures, each core has the same cache parameters. In multicore architectures where an application is partitioned across the cores (as opposed to all cores running disjoint applications), the cores’ caches directly influence each other, since the cores may have address/data ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:3 sharing interference and contention for shared cache resources. In addition to the three main cache parameters, the cache organization in multicore architectures also impacts the system energy consumption. The cache organization is dictated by cache hierarchies, and in each hierarchy, the cores may have shared caches, private caches, and/or a hybrid of shared and private caches. Cache tuning is the process of determining the best cache configuration (specific cache parameter values and organization) in the design space (the collection of all possible cache configurations) for a particular application (application-based cache tuning) [Gordon-Ross et al. 2004; Zhang et al. 2004] or an application’s phases (phasebased cache tuning) [Gordon-Ross et al. 2008; Sherwood et al. 2003b]. Cache tuning determines the best cache configuration with respect to one or more design metrics, such as minimizing the leakage power, minimizing the energy per access, minimizing the average cache access time, minimizing the total energy consumption, minimizing the number of cache misses, and maximizing performance, etc. Since high energy/power consumption is becoming a key performance-limiting factor, this survey focuses on cache tuning at the architectural level with respect to power/energy savings; however, cache tuning techniques that focus on system performance optimization share similar aspects. Cache tuning techniques estimate the cache performance (e.g., miss rates or access delays) and energy consumption for the configurations in the design space and determine the optimal cache configuration (or best configuration if determining the optimal is infeasible) with respect to the design metrics. The cache tuning time is the total time required to evaluate all of the cache configurations or a subset of the cache configurations in the design space to determine the optimal/best cache configuration. For systems with architectural support for executing multiple applications concurrently, the cache parameters’ dependency on all of the concurrent applications’ characteristics become intricate, thereby complicating cache tuning. In simultaneousmultithreaded (SMT) processors, fine-grained resource sharing requires all concurrently executing threads to compete for hardware resources, such as the instruction queue, renaming registers, caches, and translation look-aside buffers (TLBs). In chip multiprocessor (CMP) systems, only the last-level caches, memories, and communication subsystems are shared. In general, applications running on a CMP system have less interaction (cache coherence and data dependency) than applications running on an SMT processor. Systems that combine SMT and CMP further exacerbate cache tuning complexity. Even though cache tuning is a well-studied optimization technique, there is no comprehensive cache tuning survey. Venkatachalam and Franz [2005] reviewed power reduction techniques for microprocessor systems, which ranged broadly from circuit-level to application-level techniques and included only a brief discussion of cache tuning. Inoue et al. [2001] surveyed architectural techniques for caches from high-performance and low-power perspectives. Uhlig and Mudge [1997] surveyed cache tuning techniques that leveraged trace-driven cache simulation. That survey elaborated on the three basic trace-driven cache simulation steps: trace collection, trace reduction, and trace processing. We augment that survey with recent advances in trace-driven cache simulation in Section 2.2.1.4. In addition, several technical papers [Dhodapkar and Smith 2003; Meng et al. 2005; Zhang et al. 2004; Zhou et al. 2003] provided thorough overviews on one or more specific cache tuning techniques. Our survey distinguishes itself from previous surveys in that our survey provides both an up-to-date and comprehensive cache tuning review and is, to the best of our knowledge, the first such survey. We present comprehensive summaries, classify existing works, and outline each cache tuning technique’s challenges with the intent of inspiring potential research directions and solutions. Our survey contains several overlapping topics that have been discussed in some previous reviews with respect to ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:4 W. Zang and A. Gorden-Ross computer architecture performance evaluation techniques [Eeckhout 2010], computer architecture techniques for power-efficiency [Kaxiras and Martonosi 2008], and multicore cache hierarchies [Balasubramonian 2011]; however, our survey differs from these reviews in that our survey focuses on cache tuning. Since the cache tuning scope is very broad, we outline the remainder of this survey’s structure based on the following cache tuning classification: design-time offline static cache tuning and run-time online dynamic cache tuning. Offline static cache tuning is discussed in Section 2 with hardware support for core-based processors that tune cache parameters at design time in Section 2.1 and offline cache tuning techniques in Section 2.2. Online dynamic cache tuning is discussed in Section 3. Circuit-level techniques for reducing leakage power in configurable architectures are reviewed in Section 3.1. Configurable cache architectures in conjunction with dynamic cache parameter adjusting techniques for single-core architectures are reviewed in Section 3.2, and Section 3.3 summarizes cache tuning techniques for multicore architectures. In phase-based cache tuning, detecting the phase changes, that is, any change in the executing application such that a different cache configuration would be better than the previous cache configuration [Gordon-Ross et al. 2008], is another interesting topic in dynamic cache tuning, thus we discuss online and offline techniques to detect phase changes in Section 3.4. Since various techniques in cache tuning classifications in each section have their own merits, limitations, and challenges, we will elaborate on each of these within each section. Finally, Section 4 reviews architectural-level power/energy quantitative techniques and estimation tools, and Section 5 concludes our survey and summarizes future directions and challenges with respect to cache tuning. 2. OFFLINE STATIC CACHE TUNING Cache tuning performed at design time is classified as offline static cache tuning. In static cache tuning, designers evaluate an application or system during design time to determine the cache requirements and optimal cache configuration. During runtime, the system fixes the cache configuration to the offline-determined optimal cache configuration, and the cache configuration does not change during execution; therefore, there is no runtime overhead with respect to online design space exploration. Since static cache tuning determines cache parameters prior to system runtime, the actual runtime behavior with respect to dynamically changing system inputs and environmental stimuli cannot be captured. Offline static cache tuning is suitable for stable systems with predictable inputs and execution behavior. We discuss core-based processors with hardware support for offline static cache tuning in Section 2.1. In Section 2.2, we review offline static cache tuning techniques using simulators and analytical models. Finally, we summarize the offline static cache tuning techniques in Section 2.3. 2.1. Hardware Support: Core-Based Processors As reusable modules, intellectual property (IP) cores have a profound impact on system design. Core-based processors can provide tuning flexibility, such as configurable cache parameters, to a myriad of designers, enabling all designers to leverage and benefit from a single IP core design. Therefore, configurable caches are readily available to designers in core-based processors. A soft-core processor model consists of a synthesizable HDL (hardware description language) description that allows a designer to choose a particular cache configuration. After a designer customizes the processor model, the soft-core model is synthesized for the final platform. Cache parameters can be specified during design time for many commercial softcore processors. In the MIPS32 4KTM [MIPS32 2001] processor cores, the data and ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:5 instruction cache controllers support caches of various sizes, organizations, and set associativities. In addition to an on-chip variable-sized primary cache, the MIPS R4000 [1994] processor cores have an optional external secondary cache that varies in size from 128 Kbytes to 4 Mbytes. Both caches provide variable cache block size. Some ARM processor cores,1 such as the ARM 9 and ARM 11 families, provide configurable caches with total size ranging from 4 Kbytes to 128 Kbytes. ARC configurable cores2 , such as the ARC 625D, ARC 710D, and ARC 750D, enable design-time configurability of instruction and data cache sizes. Tensilica’s Xtensa Configurable Processors3 offer multiple options for cache size and associativity. Alternatively, some hard-core processors [Gordon-Ross et al. 2007; Malik et al. 2000; Zhang et al. 2004] support offline cache tuning by providing runtime configurable cache ways and/or sets (i.e., ways/sets can be selectively enabled/disabled) and a configurable block size. These hard-core processors enable the system to self-configure the cache to the designer-specified cache parameters at system startup. Since these processors can also be leveraged in runtime cache tuning, we elaborate on these processors in Section 3.2. 2.2. Offline Cache Tuning Techniques Numerous techniques exist for offline static cache tuning, including techniques that directly simulate the cache behavior using a simulator (simulation-based cache tuning, Section 2.2.1) and techniques that formulize cache miss rates with mathematical models according to theoretic analysis (analytical modeling, Section 2.2.2). However, all techniques provide the same basic output: the estimated or actual cache miss rates for the cache configurations in the design space. Using these cache miss rates, designers determine the optimal cache configuration based on the desired design metrics, such as the lowest energy cache configuration. 2.2.1. Simulation-Based Cache Tuning. Simulation-based tuning is a common technique not only for tuning the cache, but also for tuning any design parameter in the processor. Cache simulation leverages software to model cache operations and estimates cache miss rates or other metrics based on a representative application input. The simulator allows cache parameters to be varied such that different cache configurations in the design space can be simulated easily without building costly physical hardware prototypes. However, the simulation model can be cumbersome to setup, since accurately modeling a system’s environment can be more difficult than modeling the system itself. Additionally, since cache simulation is generally slow, simulating only seconds of application execution can require hours or days of simulation time for each cache configuration in the design space. Due to this lengthy simulation time, simulation-based cache tuning is generally not feasible for large design spaces and/or complex applications. In this section, we review simulators with respect to three different classification criteria: accuracy-level classification classifies simulators as functional simulation or timing simulation (Section 2.2.1.1); simulation input classification classifies simulators as trace/event-driven simulation or execution-driven simulation (Section 2.2.1.2); and simulation coverage classification classifies simulators as microarchitecture simulators or full-system simulators (Section 2.2.1.3). Microarchitecture and full-system simulators can provide functional accuracy or cycle-level accuracy and use trace/event-driven or execution-driven inputs. Aside from general trace-driven simulation for all modules in the microarchitecture, specialized trace-driven cache simulation (Section 2.2.1.4), 1 http://www.arm.com/products/processors/. 2 http://www.synopsys.com/IP/ConfigurableCores/ARCProcessors/Pages/Default.aspx. 3 http://tensilica.com/products/xtensa-customizable/configurable.htm. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:6 W. Zang and A. Gorden-Ross which only simulates the cache module, is widely used for cache tuning. Finally, we review simulator acceleration techniques and cache tuning acceleration using efficient design space exploration in Sections 2.2.1.5 and 2.2.1.6, respectively. 2.2.1.1. Functional and Timing Simulation. Functional Simulation. Functional simulation is also referred to as instruction-set emulation, since functional simulation only provides the functional characteristics of the instruction set architecture (ISA) without timing estimates. Therefore, functional simulation is typically used for functional correctness validation. Cache hits/misses can be generated from the functional simulation; however, since the timing-related execution is not available and only the user-level code is simulated, functional simulation lacks accuracy. The advantage of functional simulation is fast simulation time. Functional simulation can also be used to generate a functionally-accurate memory access trace for trace-driven cache simulation (Section 2.2.1.4). The SimpleScalar Tool Set [Austin et al. 2002] is widely used in academia and industry for single-core processor simulation and contains two microarchitecture functional simulators: sim-safe and sim-fast. SimpleScalar is available in several processor variations, such as the Alpha, PISA, ARM, and x86 instruction sets. Timing Simulation. Timing simulation is typically performed in conjunction with functional simulation. Timing simulation models architecture internals, such as the pipeline, branch predictor, detailed cache and memory access latencies, etc. and outputs timing-related statistics, such as cycles per instruction (CPI). Timing simulation generates more accurate cache hit/miss rates than functional simulation, since the timing-dependent speculative execution, out-of-order execution, and the inter thread/intercore contentions to the shared resources are simulated correctly. More simulator development time and simulation time are required than functional simulation. 2.2.1.2. Trace/Event-Driven and Execution-Driven Simulation. Trace/Event-Driven. Trace/event-driven simulation decomposes the complete simulation into the functional simulation and timing simulation. During trace collection, a functional simulator outputs the application’s instruction or execution-event trace, which serves as input to a detailed timing simulation. In trace/event-driven simulation, the functional simulation is performed only once, and the detailed timing simulation for only the events of interest (i.e., the events related to interesting system aspects that need to be evaluated) is performed many times to detect the correct occurrence time of the traced instructions/events for different microarchitectures (with different configurations). Therefore, the total simulation time for trace/event-driven simulation is much less than that of timing simulation. In addition to using a functional simulator, the traces can be collected with direct application execution by running an instrumented binary on a host machine. Execution on real hardware is much faster than functional simulation; however, the simulation target architecture and ISA must be the same as the host architecture and ISA. Existing instrumentation tools include Atom [Srivastava and Eustace 1994], Shade [Cmelik and Keppel 1994], Pin [Luk et al. 2005], Graphite [Miller et al. 2010], and CMP$IM [Jaleel et al. 2008a]. Specifically, CMP$IM is developed based on Pin for multicore cache simulation. Although trace/event-driven simulation is intuitively simple, this simulation technique has several challenges. For example, trace/event collection can be difficult for complex systems, such as systems with multiple processors, concurrent tasks, or dynamically-linked/dynamically-compiled code. Since these traces are static and reflect a single execution path, it is challenging for trace/event-driven simulation to ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:7 capture the dynamic behavior of system execution. If trace/event collection is not accurate, trace simulation results may be inaccurate. Additionally, traces are typically large (tens to hundreds of gigabytes), thus requiring significant storage space and lengthy trace processing times. Previous works proposed trace sampling [Conte et al. 1998] and trace compression [Janapsatya et al. 2007] techniques to reduce the trace size without excessive loss to the trace’s completeness and accuracy. Rico et al. [2011] proposed a trace-driven simulation technique for multithreaded applications wherein trace-driven simulation was combined with the dynamic execution of parallelism-management operations to track the dynamic thread scheduling and interthread synchronizations. Lee et al. [2010] developed tsim, which used a two-phase, trace-driven simulation approach for fast multicore architecture simulation. To simulate an out-of-order processor, the timing error overhead was estimated during trace collection in order to correct the timing errors in trace simulation. The authors also proposed directly evaluating the trace changes along with the performance (e.g., cache misses) during trace processing. To simulate the synchronizations, synchronization primitives were recorded in the trace files to model resource contentions. Execution-Driven Simulation. Execution-driven simulation tightly integrates the functional and timing simulations. In execution-driven simulation, a simulator executes the application binary and additionally simulates the architecture’s behavior. Unlike trace-driven simulation with large, static trace files, execution-driven simulation simulates dynamically-changing application execution, such as speculated executions, out-of-order executions, multithreaded and multicore accesses to shared caches, and different execution paths based on different inputs. The trade-off for these additional details as compared to trace-driven simulation is significantly longer simulation time. Execution-driven simulators are the most prevalent simulation tool used in previous works. Many execution-driven simulators have been designed, including SimpleScalar, M5 [Binkert et al. 2006], GEMS [Martin et al. 2005], Gem54 , SESC [Ortego and Sack 2004], RSIM [Hughes et al. 2002], Flexus [Wenisch et al. 2006], PTLSim [Yourst, 2007], etc. The most widely used execution-driven simulator for single processors is SimpleScalar, in which sim-cache (which is specially designed for cache simulation) provides many cache organization options. Gem5 (SE-mode), which is the combination of M5 and GEMS, supports execution-driven simulation for multicore microarchitectures. The default cache coherence is an abstract snooping protocol. Even though the cache hierarchy is easily configured and the coherence protocol can automatically extend to any memory hierarchy, the cache coherence protocol is difficult to modify. SESC is a simulator used to simulate a superscalar processor. The processor core uses execution-driven simulation, while the remainder of the architecture uses event-driven simulation, and events are called and scheduled as needed. Although SESC can simulate CMP architectures, the multicore operations are actually emulated using multiple threads. Therefore, SESC cannot simulate architectures that combine SMT and CMP. Currently, SESC only supports the MIPS ISA. 2.2.1.3. Microarchitecture and Full-System Simulation. Microarchitecture Simulation. In general, microarchitecture simulators only simulate the user-level code and do not simulate the operating system (OS)-level code. However, simulating only user-level code is not sufficient when the OS code has significant impacts on the execution (e.g., system calls, interrupts, input/output (I/O) events, etc.). Some microarchitecture simulators use a proxy call mechanism by invoking the 4 The Gem5 Simulater. http://gem5.org. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:8 W. Zang and A. Gorden-Ross host OS to emulate the changes in register state and memory state due to system calls; however, this OS emulation lacks flexibility and is not as accurate as direct OS-level code simulation. Full-System Simulation. Full-system simulators provide the highest fidelity to the actual system, as compared to all other simulators, due to the full-system simulator’s large-scale simulation coverage, which includes both user- and OS-level code and I/O devices, such as Ethernet and disks. OS code simulation is especially critical for multithreaded and multicore architectures, since OS scheduling directly affects these architectures’ execution, and thus affects cache behavior. Full-system simulation acts as a system virtual machine that includes all of the I/O, OS, memory, and other device activities. Therefore, full-system simulation covers more features of the target architecture than microarchitecture simulation but requires extremely lengthy simulation time. SimOS [Rosenblum et al. 1997] is a full-system simulator developed by Stanford University. SimOS uses a simple cache-coherent, nonuniform memory access (CCNUMA) memory system with a directory-based cache-coherence policy. Virtutech Simics [Magnusson et al. 2002] is a well-maintained, commercially available full-system functional simulator that supports many instruction sets, such as Alpha, PowerPC, x86, AMD64, MIPS, ARM, and SPARC, and device models, such as graphics cards, Ethernet cards, and PCI (peripheral component interconnect) controllers. Users can customize these modules to simulate the target processor’s details. Simics is very flexible and supports heterogeneous platforms with various processor architectures and operating systems. However, Simics does not provide cycle-accurate simulation, and not all of the source code is readily available. Another widely used full-system simulator is Gem5 (FS-mode). Gem5 provides cycle-level precise evaluation of the memory hierarchy and interconnection, and supports the same architecture families as Simics. Gem5 is flexible and convenient for cache-based research, since SLICC (Specification Language for Implementing Cache Coherence) allows users to customize cache coherence protocols from directory- to snooping-based. However, simulation that includes a SLICC-defined memory system increases the simulation time, and extending the SLICC protocol to other cache hierarchies (i.e., the protocol is only employed for a specific cache hierarchy) is difficult. Other full-system simulators include M5, QEMU [Bellard 2005], Embra [Witchell and Rosenblum 1996], AMD’s SimNow [Bedichek 2004], and Bochs [Mihocka and Schwartsman 2008]. 2.2.1.4. Specialized Trace-Driven Cache Simulation. For specialized trace-driven cache simulation, cache behavior can be simulated using a sequence of time-ordered memory accesses, typically referred to as an access trace. The access trace can be collected using any of the aforementioned simulators depending on the desired accuracy, and therefore, the access trace may contain only user-level addresses or both user-level and system-level addresses. Unlike general trace-driven simulation for a microarchitecture, trace-driven cache simulation only simulates the cache module. Since simulating the entire processor/system is slow, trace-driven cache simulation can significantly reduce cache tuning time by simulating an application only once to produce the access trace and then process that access trace quickly to evaluate the cache configurations. Trace-driven cache simulation shares the same limitations as general trace-driven simulation. For example, speculative execution is difficult to simulate accurately, since trace collection typically does not collect the wrong-path execution, and since the access trace is statically generated for fixed inputs, the evaluated cache performance may not apply to other input sets. In addition, modeling dynamic execution in real systems is challenging, since the cache access trace is fixed with a particular timing order. For in-order single-processor simulation, the memory access trace does not change as the cache architecture changes. Therefore, there is no timing dependency ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:9 for trace-driven cache simulation results. For out-of-order processors with dynamically scheduled instructions, the memory access trace changes based on the different cache misses and cache/memory access latencies. In previous works, Lee et al. [2009] and Lee and Cho [2011] predicted the impact of different cache configurations on the cache misses and approximated superscalar processor performance from in-order generated traces. In a multithreaded/multicore architecture, the trace is typically timing- and cache architecture-dependent. Goldschmidt and Hennessey [1992] investigated these dynamic effects in multithreaded and multicore simulations. For shared caches, the interleaved access order from different threads/cores changed across different cache configurations due to the changes in the cache miss addresses and related latencies for each individual thread. Additional factors, such as dynamic scheduling, barrier synchronization, and cache coherence, may also produce different timing-dependent traces for different cache configurations. In developing trace-driven cache simulation for multithreaded/multicore architectures, all of these timing dependencies must be evaluated. Even though the cache hit/miss rates are simulated correctly using trace-driven cache simulation, the overall system energy consumption is difficult to evaluate. When evaluating the energy consumption, other factors should be considered, such as techniques used to hide cache miss latencies and the contention for other critical shared resources in multithreaded/multicore architectures, which are not available in trace-driven cache simulation. Sequential Trace-Driven Cache Simulation. A simple naı̈ve technique for leveraging trace-driven cache simulation for cache tuning is to sequentially process the access trace for each cache configuration, where the number of trace processing passes is equal to the number of cache configurations in the design space. This sequential trace-driven cache simulation technique can result in prohibitively lengthy simulation time for a large access trace, thus requiring lengthy tuning time for exhaustive design space exploration. Dinero [Edler and Hill 1998] is the most widely used trace simulator for single processor systems. Dinero models each cache set as a linked list where the number of nodes in each list is equal to the cache set associativity, and each node records the tag information for the addresses that map to that set. Cache operations are modeled as linked list manipulations. For each trace address, Dinero computes the set index and tag according to the cache configuration’s parameters and then checks the corresponding linked list to determine if that access is a hit or miss. Dinero then updates the set’s linked list in accordance with the replacement policy. CASPER [Iyer 2003] provides trace-driven cache simulation for both single- and multicore architectures. The multicore cache uses the MESI coherence protocol (which is the most common protocol where every cache line is either in the modified, exclusive, shared, or invalid state) along with several variations, and the cache hierarchies employ last-level shared caches. Cache parameters are configurable for each cache level and heterogeneous configurations are allowed, but the cores’ private cache configurations must be homogenous. CMP$IM is a trace-driven multicore cache simulator that processes the trace addresses on-thefly, thus eliminating the overhead of storing large trace files. CMP$IM employs an invalidation-based cache coherence protocol. Users can specify the cache parameters, replacement policy, write policy, number of cache levels, and the levels’ inclusion policy. Similarly to CASPER, CMP$IM requires homogenous private caches. Single-Pass Trace-Driven Cache Simulation. To speed up the total trace processing time as compared to sequential trace-driven cache simulation, single-pass tracedriven cache simulation simulates multiple cache configurations during a single traceprocessing pass, thereby reducing the cache tuning time. The trace simulator processes the trace addresses sequentially and evaluates each processed address to determine if ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:10 W. Zang and A. Gorden-Ross Stack Search Case 1 Process A B C D E F Not found Conflict evaluation Stack Update Compulsory miss No evaluation A B C D E F Case 2 Process E B C D E F Potential Found conflict set B C D E F E B C D F Fig. 1. Two address-processing cases in trace-driven cache simulation using the stack-based algorithm: case 1 depicts the situation where the processed address is not found in the stack, and case 2 depicts the situation where the processed address is found in the stack. the access results in a cache hit/miss for all cache configurations simultaneously. For each processed address, the number of conflicts (i.e., previously accessed blocks that map to the same cache set as the currently processed address) directly determines whether or not the processed address is a cache hit/miss. Evaluating the number of conflicts is very time consuming because all previously accessed unique addresses must be evaluated for conflicts. Single-pass trace-driven cache simulation leverages two cache properties—the inclusion property and the set refinement property—to speed up this conflict evaluation. The inclusion property [Mattson et al. 1970] states that larger caches contain a superset of the blocks present in smaller caches. If two cache configurations have the same block size, the same number of sets, and use access order-based replacement policies (e.g., least recently used (LRU)), the inclusion property indicates that the conflicts for the cache configuration with a smaller associativity also form the conflicts for the cache configuration with a larger associativity. The set refinement property [Hill and Smith 1989] states that the blocks that map to the same cache set in larger caches also map to the same cache set in smaller caches. The set refinement property implies that the conflicts for the cache configuration with a larger number of sets are also the conflicts for the cache configuration with a smaller number of sets if both cache configurations have the same block size. Single-pass trace-driven cache simulation can be divided into two categories based on the algorithm and data structure used: the stack-based algorithm and the tree/forestbased algorithm. Early work by Mattson et al. [1970] developed the stack-based algorithm for fullyassociative caches, which served as the foundation for all future trace-driven cache simulation work. Figure 1 depicts two address-processing cases in trace-driven cache simulation using the stack-based algorithm. Letters A, B, C, D, E, and F represent different addresses that map to different cache blocks. The trace simulator processes the access trace one address at a time. For each processed address, the algorithm first performs a stack search for a previous access to the currently processed address. Case 1 depicts the situation where the currently processed address has not been accessed before and is not present in any previously accessed cache block (the address is not present in the stack). Therefore, this access is a compulsory cache miss. Case 2 depicts the situation where the currently processed address is located in the stack and has been previously accessed. Conflict evaluation evaluates the potential conflict set (all addresses in the stack between the stack’s top and the currently processed address’s previous access location in the stack) to determine the conflicts. The number of conflicts directly determines the minimum associativity (i.e., cache size for a fully-associative ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective Number of sets 2 1 0 4 8 32:11 00 000 01 10 100 010 110 001 11 101 011 111 Fig. 2. An example of the tree/forest structure for trace-driven cache simulation using the tree/forest-based algorithm. The rectangles correspond to tree nodes, and values in the rectangles correspond to the indexes of the cache sets. cache) necessary to result in a cache hit for the currently processed address. After conflict evaluation, the stack update process updates the stack’s stored address access order by pushing the currently processed address onto the stack and removing the previous access to the currently processed address if the current access was not a compulsory miss. Hill and Smith [1989] extended the stack-based algorithm to simulate direct-mapped and set-associative caches and leveraged the set refinement property to efficiently determine the number of conflicts for caches with different numbers of sets. Thompson and Smith [1989] introduced dirty-level analysis and included write-back counters for write-back caches. Since the time complexity of both the stack search for the processed address and the conflict evaluation is on the order of the stack size (which in the worst case, is equal to the number of uniquely accessed addresses), the stack search and conflict evaluation can be very time consuming. To speed up the trace processing time, much work has focused on replacing the stack structure with a tree/forest structure that stores and traverses accesses more efficiently than the stack structure by storing the cache contents for multiple cache configurations in a single structure. The commonly used tree/forest structure in trace-driven cache simulation uses different tree levels to represent different cache configurations with a different number of cache sets. Figure 2 depicts an example of this tree/forest structure. In this example, the number of cache sets can be configured as 2, 4, and 8, which requires two three-level trees. Each rectangle in the figure represents a single node, where each node stores each cache set’s contents, and the rectangle’s value corresponds to the cache set’s index address. In this manner, the number of addresses stored in a node directly indicates the number of conflicts. When processing a new address, the processed address is searched for in the tree and added to the tree nodes based on the set that the processed address would map to in each level. Since the tree/forest structure in Figure 2 can only simulate cache configurations with a fixed block size, multiple trees/forests are typically used to simulate multiple block sizes in a single pass [Hill and Smith 1989; Janapsatya et al. 2006]. One drawback of the tree/forest structure is redundant storage overhead, because each unique block address must be stored in one node in each level. Sugumar and Abraham [1991] developed two structures/algorithms to simulate the cache configurations with either a fixed block size or a fixed cache size while varying the remaining two parameters. Their novel structures reduce the storage by a factor of two, as compared to previous tree/forest structures. Tree/forest-based algorithms have several disadvantages compared to stack-based algorithms, such as increased complexity that requires complicated processing operations, which make tree/forest-based algorithms not easily amenable to hardware implementation for runtime cache tuning. Therefore, the stack-based algorithm is still widely used. Viana et al. [2008] proposed SPCE—a stack-based algorithm that evaluated cache size, block size, and associativity simultaneously using simple bit-wise operations. Gordon-Ross et al. [2007] designed SPCE’s hardware prototype for runtime cache tuning and recently extended the stack-based algorithm for exclusive two-level cache simulation [Zang and Gordon-Ross 2011]. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:12 W. Zang and A. Gorden-Ross Although single-pass trace-driven cache simulation is a prominent solution for reducing tuning time, several drawbacks limit the single-pass trace-driven cache simulation’s versatility. Both the stack- and tree/forest-based algorithms restrict the replacement policy to accessing order-based replacement policies. Additionally, some single-pass algorithms that simulate caches with only one or two variable parameters must reexecute the algorithm several times in order to cover the entire design space. Furthermore, single-pass trace-driven cache simulation for multilevel caches is complex. For example, in a two-level cache hierarchy, the level-one cache filters the access trace and produces one filtered access trace for each level-two cache (i.e., each unique levelone cache configuration’s misses form a unique filtered access trace for the level-two cache). When directly leveraging single-pass trace-driven cache simulation for singlelevel caches to simulate each of the two-level caches, the two-level cache simulation must store and process each filtered access trace separately to combine each leveltwo cache configuration with each level-one cache configuration. Since the number of filtered access traces is dictated by level-one cache’s design space size, the time and storage complexities can be considerably large for multilevel cache simulation. Finally, single-pass trace-driven cache simulation does not support multithreaded/multicore architectures. 2.2.1.5. Simulation Acceleration. As today’s processors increase in complexity and functionality, the simulation time required for detailed cycle-accurate simulation is becoming infeasible, especially for multithreaded/multicore microarchitectures, fullsystems, and highly configurable caches. Therefore, multiple techniques are proposed to accelerate the simulation time. Instead of simulating the entire application, sampled simulation simulates only a small fraction of the application—the critical regions. Intervals sampled from the critical regions are simulated to represent the entire application’s execution characteristics. Challenges include determining the interval length and sampling frequency. The interval length should be long enough to capture the spatial and temporal localities of the cache accesses, and the sampling frequency must be fast enough to capture all relevant execution changes; however, long intervals and frequent sampling complicate simulation and increase simulation time. Previous works suggest three techniques for guiding interval length and frequency sampling selection. Random sampling [Laha et al. 1988] is the easiest technique and evaluates cache performance at a random sampling frequency with a fixed interval length, but may provide an inaccurate view of an application’s entire execution behavior. Periodic sampling, such as SMARTS (Sampling Microarchitecture Simulation) [Wunderlich et al. 2003], selects the simulation intervals using a fixed period across the entire execution, and the loss in performance accuracy is quantified using statistical theory. Finally, the most complicated but accurate technique is to select a representative interval for each execution phase and then estimate the overall performance using weighted accumulation. SimPoint [Hamerly et al. 2005] is the most well-known tool for selecting such representative intervals (Section 3.4.2.) Another acceleration technique is statistical simulation [Eeckhout et al. 2003; Genbrugge et al. 2006; Joshi et al. 2006], which consists of three steps. The first step is statistical profiling, which collects statistics about key application characteristics, such as the statistical instruction types, control flow graph, branch predictability, and cache behavior. The second step is synthetic trace generation, which produces a short synthetic trace that includes all of the statistical characteristics. The last step is statistical processor modeling, which simulates the synthetically generated trace. The simulator model is simple, such that the cache is not explicitly modeled, and instead, the simulator simulates the miss correlation and models the miss latency as delayed hits. Due to ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:13 the short synthetic trace and the simplified simulator, there is a significant speedup in simulation time. Finally, due to the high computing capability of multicore processors and machine clusters, parallel computing provides a straightforward technique for speeding up simulation-based cache tuning. Since each cache configuration can be evaluated independently, the design space can be distributed across multiple processors, and each processor simulates a different cache configuration simultaneously. However, this straightforward distribution does not speed up a single cache configuration’s simulation time, which can range from hours to days or weeks for complex systems and/or lengthy applications. Therefore, much research focuses on more sophisticated parallel simulators [Chidester and George 2002; Falcón et al. 2008; Chen et al. 2009], which essentially partition the simulation work across multiple host threads, and the threads are synchronized in each/multiple simulated cycle(s) for cycle-accurate simulators. Specifically for trace-driven cache simulation, Heidelberger and Stone [1990] partitioned the access trace into non-overlapping segments for parallel simulation. Sugumar [1993] decreased the stack processing time using a parallel stack search technique. Wan et al. [2009] developed a GPU (graphics processing unit)-based trace simulator in which different cache configurations were simulated in different threads in parallel. Additionally, for each cache configuration simulation, the simulator maintained different cache sets in parallel and searched the matched tag for the processed address in the cache set with a large amount of cache ways in parallel. 2.2.1.6. Cache Tuning Acceleration Using Efficient Design Space Exploration. Assuming that the cache performance evaluated using simulation is correct, an exhaustive search always determines the optimal cache configuration, because this technique exhaustively evaluates all configurations in the design space. However, since design spaces are typically large, the design exploration time (i.e., cache tuning time) for exhaustive search is prohibitively long. Viana et al. [2006] showed that if intelligently selected, a small subset of cache configurations would yield effective cache tuning, resulting in a fraction of the exhaustive design exploration time. Therefore, efficient design space pruning techniques are critical for simulation-based cache tuning and trade off accuracy for reduced tuning time by searching only a subset of the design space. Even though these techniques do not guarantee optimality, careful design can yield near-optimal results. Heuristic search techniques, such as Zhang et al.’s [2004] single-level cache tuning and Gordon-Ross et al.’s [2004, 2009] two-level and unified cache tuning, developed for online dynamic cache tuning (Section 3.2.3), can be directly used for offline simulationbased cache tuning to prune uninteresting cache configurations from the design space and significantly speed up design exploration. As an optimization problem, cache tuning can use genetic algorithms to accelerate design space exploration. Since the optimal cache configuration may be different for each execution phase, offline exhaustive exploration of the design space for each phase is infeasible. To address this problem, Dı́az et al. [2009] divided the execution into intervals with a fixed number of instructions, and determined the optimal cache configuration for each interval using a genetic algorithm. The genetic algorithm modeled individuals in the population as a sequence of configurations for all the consecutive intervals, and the evaluation metric was the overall instructions per cycle (IPC). Given a basic set of configuration sequences, the genetic algorithm obtained new configuration sequences as the IPC improved. During runtime, the offline-determined optimal configuration sequence was directly used for all of the intervals. 2.2.2. Analytical Modeling-Based Cache Tuning. Unlike cache simulation techniques that require lengthy simulation/tuning time, analytical modeling directly calculates cache ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:14 W. Zang and A. Gorden-Ross misses for each cache configuration using mathematical models. Since the cache miss rates are calculated nearly instantaneously using computational formulas, the cache tuning time is significantly reduced. Analytical models require detailed statistical information and/or information on critical application events, which can be collected using profilers, such as Cprof [Lebeck and Wood 1994], Ammons et al. [1997], ProfileMe [Dean et al. 1997], and DCPI [Anderson et al. 1997]. Analytical models oftentimes use complex algorithms that can be difficult to solve [Chatterjee et al. 2001; Ghosh et al. 1999]. Some analytical models reduce the complexity by introducing empirical parameters inferred from regression [Ding and Zhong 2003], and some analytical models can only produce statistically accurate cache miss rates by assuming the accessed cache blocks map to all cache sets uniformly [Brehob and Enbody 1996; Zhong et al. 2003], which violates actual cache behavior. Other analytical models are derived from the information gathered by a detailed compiler framework [Chatterjee et al. 2001; Ghosh et al. 1999; Vera et al. 2004], which may result in inaccurate miss rates due to limited compiler information and unpredictable hardware. Therefore, analytical modeling-based cache tuning is typically less accurate than simulation-based cache tuning. Previous works mostly focus on two distinct analytical modeling categories based on either application structures or access traces. 2.2.2.1. Analytical Modeling Based on Application Structures. Since an application’s spatial and temporal locality characteristics, which are mainly dictated by loops, determine cache behavior, cache misses can be estimated based on application structures gathered from specially designed loop profilers. However, since loop characteristics do not sufficiently predict exact cache behavior, which also depends on data layout and traversal order, estimating an application’s total cache behavior based only on an application’s loop behavior may be inaccurate. Ghosh et al. [1999] generated a set of cache miss equations that summarized an individual loop’s memory access behavior, which provided precise cache miss rate analysis independent of the data layout. However, directly calculating the cache miss rate from these equations was NP-complete. Vera et al. [2004] proposed a fast and accurate approach to estimate cache miss equation solutions using sampling techniques to approximate the absolute cache miss rate for each memory access. Since both of these models were restricted to perfectly nested loops with no conditional expressions, these models had limited applicability. Chatterjee et al. [2001] extended the model to include nonlinear array layouts using a set of Presburger formulas to characterize and count cache misses. Harper et al. [1999] presented an analytical model that approximated the cache miss rate for any loop structure. Vivekanandarajah et al. [2006] determined near optimal cache sizes using a loop profiling-based technique for loop statistic extraction. Cascaval et al. [2003] estimated the number of cache misses based on stack algorithms. A stack distance histogram was collected using data dependence information, which was obtained from the array accesses in the loop nests. 2.2.2.2. Analytical Modeling Based on Memory Access Traces. Analytical modeling based on a memory access trace is similar to trace-driven cache simulation, by using a functional simulation to produce the memory access trace; however, instead of simulating cache behavior as in trace-driven cache simulation, mathematical equations statistically or empirically analyze the memory access trace to determine the cache miss rates. Most analytical modeling based on a memory access trace leverages the reuse distance between successive accesses to the same memory address. The reuse distance is the number of unique memory accesses (trace address or block address) between these two successive accesses [Brehob and Enbody 1996; Ding and Zhong 2003] and ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:15 is essentially equal to the number of conflicts determined by stack-based trace-driven cache simulation for a fully-associative cache [Mattson et al. 1970]. The reuse distance captures the temporal distance between successive accesses and the spatial locality when using individual cache block addresses as the unique memory access for determining the reuse distance. Brehob and Enbody [1996] leveraged the reuse distance to calculate cache hit probabilities using a uniform distribution assumption for memory accesses. Zhong et al. [2003] analyzed the variation in reuse distances across different data input sets to predict fully-associative cache miss rates for any input. Fang et al. [2004] predicted per-instruction locality and cache miss rates using reuse distance analysis. Ghosh and Givargis [2004] used an analytical model based on access trace information to efficiently explore cache size and associativity and directly determine a cache configuration to meet an application’s performance constraints. For CMP architectures, Chandra et al. [2005] proposed a model using access traces of isolated threads to predict interthread contention for shared cache. Reuse distance profiles were analyzed to predict the extra cache misses for each thread due to cache sharing, but the model did not consider the interaction between CPI variations and cache contention. Chen and Aamodt’s [2009] model predicted fine-grained cache contention, but only considered the CPI variations caused by cache contention, and there was only one CPI correction iteration. The author’s model used the CPI acquired in the absence of contention to predict the shared cache misses, and then the cache misses were fed back to correct the CPI and provide a more accurate estimate. Eklov et al. [2011] proposed a simpler model that calculated the CPI considering the cache misses caused by cache contention. The authors calculated the CPI as a function of the cache miss rates and then solved for the cache miss rates using this function. Similarly, Xu et al. [2010] predicted the cache miss rates for a CMP’s shared cache using the per-thread reuse distance. Instead of using simulation, the reuse distance was acquired using hardware performance counters by running the application on a real processor. Xiang et al. [2011] developed a low-cost profiling technique for full-execution analysis with a guaranteed precision. The shared cache miss rate was similarly modeled. All of these analytical models for shared cache miss rates assumed the applications running on different threads/cores were independent with no data sharing, and only shared cache contention among concurrent threads were considered. Ding and Chilimbi’s [2009] locality model for multithreaded applications considered both thread interleaving and data sharing. Shi et al. [2009] analyzed a CMP’s private and shared caches, and data replication in the private caches was modeled. To simulate real-time interactions among multiple cores, the authors developed a single-pass trace-driven stack simulation, where a shared stack and per-core private stacks were maintained to collect reuse distances to calculate the CMP private and shared cache misses with various degrees of data replication. In addition to the analytical model built on statistical characteristics, empirical models were developed for multicore cache miss rate calculation, such as that at Gluhovsky and O’Krafka [2005] for Sun microarchitectures. All of these analytical models are limited to in-order processors. For out-of-order processors, modeling thread interleaving becomes highly complex. 2.3. Summary of Offline Static Cache Tuning In offline static cache tuning, designers determine the optimal cache configuration to meet an application’s specific design metrics before system execution. Table I summarizes the offline static cache tuning techniques and associated attributes reviewed in this section. The first major column lists the cache tuning techniques, where some major techniques are subdivided into different variations. The second major column summarizes the main features of each technique. The third major column compares the ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:16 W. Zang and A. Gorden-Ross Table I. Summary of Offline Static Cache Tuning Techniques Techniques Functional simulation Timing simulation Simulationbased cache tuning Analytical modelingbased cache tuning Features Only provides functionally correct simulation. Includes simulation for timing related execution. Comparison 1. Timing simulation is more accurate. 2. Functional simulation is faster than timing simulation (25-338X speedup [Lee et al. 2010]). Trace/event-driven simulation Uses a functional 1. For trace/event-driven simulator to simulation, trace/event generate traces. collection is challenging. Trace is input into a 2. In trace/event-driven timing simulator. simulation, capturing the Many simulation dynamic behavior of iterations with multithreaded/multicore different applications and speculative configurations. execution paths is challenging and input sets are fixed, thus Execution-driven simulation Integrates functional trace/event-driven simulation is and timing generally less accurate than simulations. execution-driven simulation. Application binary 3. Trace/event-driven simulation executes on a is faster than execution-driven simulator. simulation (125-171X speedup [Lee et al. 2010]). Microarchitecture simulation Widely used. 1. Full-system simulation is more Limited to user-level accurate than microarchitecture code simulation. simulation since the OS calls, Emulates OS calls. OS scheduling, and I/O events can be simulated precisely. Full-system simulation Simulates both user- 2. Microarchitecture simulation is and OS-level code. faster than full-system Includes system device simulation (182-338X speedup models. [Lee et al. 2010]). Acts as a virtual machine. Specialized Sequential Simulates memory 1. Single-pass simulation traceaccess trace for each simulates multiple driven configuration. configurations simultaneously, cache Simulating only the which is faster than simulation cache module is sequentially simulating each faster than configuration (8-15X speedup simulating the [Zang and Gordon-Ross 2011]). entire system (100-1000X speedup [Jaleel et al. 2008a]). Single-pass Trace is fixed. Cannot 2. Single-pass trace-driven cache simulate dynamic simulation is confined to a changes in a real limited cache organization system. Not (single/two-level sufficiently accurate. single-processor caches) and access order-based replacement policies. Based on Calculates cache 1. Fast. application hit/miss rates using 2. Complex algorithm. structures mathematical 3. Cache miss rates may be models. inaccurate. 4. Imposes some assumptions on Based on memory application code and/or access trace hardware characteristics. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:17 techniques based on simulation/evaluation accuracy and speed aspects. The accuracy in this table measures how closely the cache performance evaluated by the simulator or the analytical model is to real hardware execution. In simulation time comparison, we include some quantified simulation speedups collected directly from previous works. These numbers provide only general comparisons, since the simulation speedups may change considerably for different simulators, benchmarks, system settings, and design spaces. In general, when comparing simulators, more accurate and higher coverage simulators are more complicated to design and execute and thus have slower simulation time. We also discussed cache tuning acceleration techniques, which included simulation and design space exploration acceleration. These two acceleration techniques are orthogonal and can be employed concurrently. To further increase the speedup, all of the reviewed simulation acceleration techniques (including sampled simulation, statistical simulation, and parallel simulation) can be applied together. However, the acceleration is obtained at the cost of sacrificing accuracy, resulting in potentially suboptimal cache configurations. In analytical modeling-based cache tuning, the model’s inputs generally consist of the events of interest, detailed statistical information gathered during execution, or the code and data flow structures collected using profilers and/or compilers. The profiler can be built as part of the simulator or as real hardware to collect detailed information during the application’s simulation or execution. However, statically gathered profile information may be useless if the profiling information changes based on the input sets, data layouts, and microarchitectures. Compiler information can only consider the code layout characteristics, and even though it is difficult for a compiler to thoroughly analyze data, compilers provide a basic view of static control and data flow and can easily modify the code, which can provide a basic analysis of instruction cache performance and phase changes. Several offline static cache tuning characteristics and techniques are amenable to online dynamic cache tuning, such as efficient heuristic techniques to prune the design space, hardware implementation of the single-pass trace-driven cache simulation techniques, and analytical modeling based on memory access traces. In the next section, we elaborate on online cache tuning techniques, including how these offline static cache tuning characteristics and techniques can be leveraged during online dynamic cache tuning. 3. ONLINE DYNAMIC CACHE TUNING Online dynamic cache tuning alleviates many of the drawbacks of offline static cache tuning by adjusting cache parameters during runtime. Since online cache tuning executes entirely during runtime, designers are not burdened with setting up realistic simulation environments or modeling accurate input stimuli, and thus all design-time efforts are eliminated. Additionally, since online cache tuning can dynamically react to changing execution, the cache configuration can also change to accommodate new execution behavior or application updates. Online dynamic cache tuning is appropriate for multicore architectures, since offline prediction of coscheduled applications is difficult due to variations in the applications’ execution times with respect to different cache configurations. Additionally, an exponential increase in the number of combinations of coscheduled applications makes offline cache tuning for all possible combinations infeasible. However, online dynamic cache tuning requires some hardware overhead to monitor cache behavior and control cache parameter changes, which introduces additional power, energy, and/or performance overheads during cache tuning and can diminish the total savings. Thus, online cache tuning time is more critical than offline cache ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:18 W. Zang and A. Gorden-Ross tuning time. Online cache tuning must determine the optimal cache configuration fast enough to react to execution changes, and the configurable cache architecture must have a lightweight mechanism for dynamically changing the cache parameters that incurs minimal overheads without affecting system performance. Whereas a soft-core configurable cache (Section 2.1) can be custom-synthesized for a particular cache configuration, that configuration remains fixed for the entire lifetime of the system. Alternatively, a hard-core configurable cache allows the cache parameters and organization to be varied during runtime and includes an integrated online cache tuning technique implemented as specialized hardware or a custom coprocessor to adjust the parameter and organization. Since hard-core configurable cache architectures are based on circuit-level techniques that allow for selectively activating portions of the cache elements (such as cache blocks, ways, or sets) to reduce leakage power, we first review circuit-level techniques leveraged by configurable caches in Section 3.1. Section 3.2 reviews the configurable cache architectures in conjunction with the architectures’ corresponding cache tuning techniques for single-core architectures. Section 3.3 reviews some prevalent cache tuning techniques for multicore architectures, which share fundamental ideas with single-core cache tuning; however, special considerations must address multicore cache tuning challenges, such as cache interference between multiple threads/cores and complicated shared and private last-level cache design. In online dynamic cache tuning, the cache tuning process occurs either at the beginning of application execution for application-based cache tuning or at specified intervals during application execution for phase-based cache tuning. In phase-based cache tuning, not only how to tune the cache (how to explore the design space), but also when to tune the cache is important. Phase-change detection techniques are widely studied and readily incorporated transparently into configurable cache architectures. Section 3.4 summarizes phase-change detection techniques with respect to cache tuning for both single- and multicore architectures. 3.1. Circuit-Level Techniques for Reducing Leakage Power To reduce leakage power consumption, major semiconductor companies employ high-k [Bohr et al. 2007] dielectric materials to replace the common SiO2 oxide material in the process technologies. Since the process-level solution can be orthogonally combined with architecture-level techniques and the process-level solution was scarcely considered in architecture-level cache tuning design, this survey only focuses on circuitlevel techniques that are commonly used in architectural-level cache tuning hardware, rather than the process-level solutions for semiconductors. Since the number of active transistors directly dictates a circuit’s leakage power, deactivating unused transistors can reduce the leakage power. For example, infrequently used or unused cache blocks/sets can be turned off or placed into a low leakage mode using techniques, such as a gated-Vdd cache (Section 3.1.1), a drowsy cache (Section 3.1.2), or threshold voltage (VT ) manipulation (Section 3.1.3). 3.1.1. Gated-Vd d Cache. Powell et al. [2000] developed the gated-Vdd cache, wherein memory cells (e.g., SRAM) could be disconnected from the power and/or ground rails using a gated-Vdd transistor. Figure 3 depicts a memory cell with an NMOS gatedVdd . During a cache access, the address decode logic determines the wordlines to be activated, which causes the memory cell to read the values out to the precharged bitlines. As shown in Figure 3, the two inverters have a Vdd to Gnd leakage path. If the gated-Vdd transistor is turned on, the memory cell is in an active mode for normal operations. Otherwise, the memory cell is in a sleep mode and quickly loses the stored value. Recharging the power supply sets the memory cell to a random logic state. The ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:19 Vdd _____ bitline wordline gated-Vdd control bitline Gnd Fig. 3. A memory cell with an NMOS gated-Vdd [Powell et al. 2000]. size of the gated-Vdd transistor should be designed carefully. The gated-Vdd transistor should be large enough so that the transistor does not impact the cell read/write speed while in the active mode. However, too large of a gated-Vdd transistor will diminish the sleep mode functionality and reduce the energy savings. Since the gated-Vdd transistor can be shared among multiple memory cells, the area overhead is very small. Although the sleep mode efficiently reduces the leakage power, a cache block in sleep mode does not preserve the stored data and an access to that data results in a cache miss and an associated wakeup time. Since the wakeup time and cache miss can impose significant performance overheads, cache tuning techniques that leverage gated-Vdd should conservatively place cache blocks into sleep mode. Gated-Vdd was implemented in several configurable cache architectures for dynamic tuning, such as the DRI (DeepSubmicron Instruction) cache [Powell et al. 2001] and cache decay [Kaxiras et al. 2001; Zhou et al. 2003] (Section 3.2). 3.1.2. Drowsy Cache. In a cache with gated-Vdd , the supply voltage for a memory cell is either fully on or fully gated (off). The drowsy cache [Flautner et al. 2002] provides a compromise between on and off by reducing the supply voltage as low as possible without data loss. The memory cells that are not actively accessed can be voltagescaled down to a drowsy mode to reduce the leakage power. However, the scaled voltage makes the circuit process variation-dependent and more susceptible to single event upset noise. These problems can be relieved by using an appropriate process technique in semiconductor production and choosing a conservative Vdd value. Rather than completely cutting off the circuit connection to power and/or ground rails in a gated-Vdd cache, the nonzero supply voltage in the drowsy mode can preserve the memory cell’s state. With respect to the data-preserving capability, the drowsy cache is referred to as a state-preserving cache and the gated-Vdd cache is referred to as a state-destroying cache. In a drowsy cache, accessing a cache block in active mode does impose any performance loss; however, accessing a cache block in drowsy mode requires first waking up the block, otherwise the data may be read out incorrectly. The transition from drowsy mode (low voltage) to full Vdd requires a one- or two-cycle wakeup time, which is much less than a cache miss. Therefore, cache blocks can be put into drowsy mode more aggressively than sleep mode. For the cache blocks with data that will be accessed again but not for a sufficiently long period of time, drowsy mode is superior to sleep mode because no cache miss is incurred. However, since a drowsy cache does not fully turn off the memory cells, a drowsy cache cannot reduce the leakage power by as much as a gated-Vdd cache. Figure 4 depicts the schematic of a memory cell in a drowsy cache. The circuit uses two PMOS gate switches to supply the normal supply voltage Vdd for the active mode and the low supply voltage VddLow for the drowsy mode, respectively. The memory cell is ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:20 W. Zang and A. Gorden-Ross LowVolt Vdd VddLow _______ LowVolt VT0>0.3V VT0>0.2V _____ bitline Gnd bitline wordline Fig. 4. Schematic of a memory cell in a drowsy cache [Flautner et al. 2002]. connected to a voltage-scaling controller, which determines the supply voltage between the active mode and the drowsy mode based on the state of the drowsy bit. Various dedicated tuning techniques have been developed to set/clear the drowsy bit based on the cache access pattern (Sections 3.2 and 3.4). 3.1.3. Threshold Voltage (VT ) Manipulation. Manipulating the threshold voltage VT of transistors can reduce leakage power. However, since the device speed depends on the difference between Vdd and VT , manipulating VT results in a trade-off between the speed and power. Decreasing VT increases speed while consuming high power, whereas increasing VT can significantly reduce the power consumption, but at the expense of slower speed. Dual-VT [Dropsho et al. 2002b] statically employed low-VT (high speed, high leakage) transistors on the critical paths and high-VT (low speed, low leakage) transistors on the noncritical paths at design time. The cache was implemented using high-VT transistors in the memory cells and low-VT transistors in all other areas within the SRAM. Dynamically raising the VT was accomplished by modulating the back-gate bias voltage for the multithreshold-CMOS (MTCMOS) [Hanson et al. 2003]. The memory cells with high VT were set to drowsy mode while preserving the cells’ value. Another technique used dynamic forward body-biasing to low-leakage SRAM caches [Kim et al. 2005] wherein only the active portion of the cache was modulated to low-VT for fast reads and writes. The MTCMOS technique is similar to the drowsy cache with respect to modulating the power supply voltage, and a cache block in drowsy mode must be woken up prior to accessing the cache block’s data. As compared to the drowsy cache, the MTCMOS cache provides a higher reduction in leakage power; however, the tradeoff is more complicated circuit control and a higher overhead due to the longer drowsy mode wakeup time. 3.1.4. Summary. Taking the CMOS fabrication cost and the compromise between leakage power savings and the device speed into consideration, the gated-Vdd and drowsy caches are the most suitable for state-destroying and state-preserving techniques, respectively, as compared to threshold voltage manipulation [Chatterjee et al. 2003]. Therefore, the gated-Vdd and drowsy caches are given more attention in architecturallevel configurable cache design than threshold voltage manipulation. Based on different features provided by the gated-Vdd cache and the drowsy cache, Meng et al. [2005] explored the limits of leakage power reduction for the two techniques using a parameterized model. Li et al. [2004] compared the effectiveness of the two techniques for different level-two cache latencies and different gate oxide thickness values and showed that the gated-Vdd cache could be superior to a drowsy cache when ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective Configurable cache architecture Parameter change controller System execution 32:21 System metric tracking System metric collection Best cache parameters or parameters adjustment Cache tuning decision Fig. 5. Cache tuning components. the level-two cache access time was sufficiently fast. Li et al. [2002] concluded that using a gated-Vdd cache for the level-one cache and a drowsy cache for the level-two cache yielded substantial leakage power savings without significantly impacting the performance. A hybrid of the drowsy and gated-Vdd techniques can be used in a single cache. The cache blocks with a short period of inactivity are put into drowsy mode. If those cache blocks continually stay inactive for a long period, the blocks are powered off to sleep mode. In the hybrid cache, the problems for the state-destroying cache, such as the large overhead in accessing the deactivated cache blocks and the extra energy consumed during the long waiting period of detecting the unused blocks, are alleviated. Zhang et al. [2002] used compiler analysis for loop instructions and data flow to guide the power-mode switch between active, drowsy, and sleep modes. Power-mode control instructions for changing the supply voltages were inserted into the code in order to activate/deactivate cache blocks during runtime. 3.2. Cache Tuning for Singe-Core Architectures Figure 5 depicts the three essential cache tuning components and the interactions between these components. The configurable cache architecture contains a parameter change controller that orchestrates the cache parameter changing process. Circuit-level techniques that selectively activate portions of the cache elements on different levels of granularity provide the basis for adjusting the total cache size, block size, and associativity. The parameter change controller leverages these circuit-level techniques to dynamically adjust the cache parameters. System metric tracking can directly monitor the system’s power/energy consumption, collect cache misses within a time interval, or record cache access patterns using specially-designed hardware counters or software profiling. The cache tuning decision evaluates the system metrics to determine the best (optimal may not be known) cache configuration using one of two techniques: (1) evaluating a series of configurations and selecting the best configuration; or (2) increasing or decreasing (adjusting) the cache parameters based on the system metric tracking results. The cache tuning decision can be implemented in either hardware or software, where the optimization target criteria (e.g., lowest energy consumption) are calculated from the collected system metrics for different cache configurations and are compared to determine a better cache configuration than the previous configuration. The cache tuning decision, best cache parameters, or parameter adjustment, is input into the configurable cache architecture, where the parameter change controller changes the cache parameters accordingly. In this section, we discuss cache tuning techniques, categorized based on the technique’s cache parameter adjustment capabilities, in conjunction with the configurable cache architecture leveraged by the cache tuning techniques. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:22 W. Zang and A. Gorden-Ross 3.2.1. Way/Bank Management. Cache way/bank management provides the mechanism with means to adjust the total cache size and associativity using the following techniques: shutting down cache ways/banks (configurable size); concatenating cache ways (configurable associativity); configuring individual cache ways to store instructions only, data only, both instructions and data, or shutdown (way designation); and/or partitioning a large cache structure into multilevel caches. Motorola’s M∗ CORE [Malik et al. 2000] processor included a four-way configurable level-two cache with way designation. Albonesi et al. [1999] proposed a configurable cache that partitioned the data and tag arrays into one or more subarrays for each cache way. First, the application was run with all ways enabled and performance-sampling techniques were used to collect cache access information. Based on this information, the performance degradation and energy savings afforded by disabling ways were estimated. Balasubramonian et al. [2000] improved that configurable cache by partitioning the remaining ways as level-two cache. Therefore, the cache could be virtually configured as a single- or two-level cache hierarchy. For each cache access, the level-one partition was checked first. On a cache miss, the level-two partition was checked. During cache tuning, the level-one cache was initialized to the smallest size. After a certain interval, if the accumulated miss rate was larger than a predetermined threshold, the level-one cache was increased to the next larger size until the level-one cache size reached the maximum value or the miss rate was sufficiently small. Amongst the examined cache configurations, the configuration with the lowest CPI was chosen. Based on the way selective cache [Albonesi 1999], Dropsho et al. [2002a] partitioned the cache ways into primary ways and secondary ways, where cache accesses checked only the primary ways first, then checked the secondary ways on a primary way miss. In this architecture, the LRU information was recorded for performance tracking, from which the hit ratios for any partitioning of the cache were constructed to calculate the energy consumption of each possible cache configuration. Kim et al. [2002] developed control schemes for drowsy instruction caches in which only the bank that was predicted to be accessed was kept active, while the other banks were put into drowsy mode. Ishihara and Fallah [2005] proposed a non-uniform cache architecture where different cache sets had different numbers of ways. The authors’ proposed cache tuning algorithm greedily reduced the energy by iteratively searching the number of ways that offered the largest energy reduction for each cache set. Ranganathan et al. [2000] partitioned cache ways and used different partitions for different processor activities, such as hardware lookup tables, storage area for prefetched information, or other modules for hardware optimization and software controlled usage. 3.2.2. Block/Set Management. Block/set management provides a finer granularity management technique, compared to way/bank management. Instead of managing the cache blocks/sets globally, the cache blocks/sets can be individually managed. However, as a trade-off, this finer granularity requires more hardware overhead to monitor and manage each individual cache block/set. A common block/set management technique shuts down individual cache sets/blocks to vary the cache size and/or fetches or replaces a variable number of cache blocks simultaneously to adjust the block size. Powell et al. [2001] developed the DRI (Deep-Submicron Instruction) cache, which downsized/upsized cache sets by powers of two. All of the tag bits necessary for the smallest cache size were maintained, and cache lookup used an index mask to select the appropriate number of index bits for the current number of sets. The number of cache misses were accumulated with a counter after a fixed interval to determine the cache upsize/downsize based on the relationship between the measured miss rate and a preset value. Kaxiras et al. [2001] varied the cache size on a per-cache block basis by shutting down cache blocks that contained data that was not likely to be reused. A decay ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:23 counter associated with each cache block was inspected at an adaptively varied decay interval to determine when a cache block should be shut down. Adaptively adjusting the decay interval was critical to maximizing the performance and energy savings. An aggressively small decay interval could cause extra cache misses due to shutting down cache blocks that were still “alive,” thereby destroying the cache resizing advantages. If the decay interval was too large, the unused cache blocks remained active for a longer period of time while waiting for the decay interval to expire, thus expending extra energy. Flautner et al. [2002] provided two schemes for managing the cache blocks. One scheme periodically put all cache blocks into drowsy mode and drowsy blocks were woken up when the block was accessed. This scheme required only a single global counter; however, the performance overhead could be high due to the extra cycles spent on the drowsy blocks’ wakeup time. The second scheme put the cache blocks into drowsy mode if the block had not been accessed during a decay interval. This scheme had more hardware overhead due to monitoring per-block accesses. Hu et al. [2003] improved the drowsy cache control proposed by Flautner et al. [2002] by exploiting the code behavior to identify subroutine instructions. These instructions were tagged as hotspots and were kept active and excluded from drowsy control. Zhou et al. [2003] only deactivated the data portion of cache blocks and kept the tag active. The decay interval was dynamically adjusted by monitoring the miss rate caused by the deactivated blocks acquired from the active tags. If the miss rate caused by the deactivated blocks was too small, the decay interval was decreased and the total percentage of deactivated blocks was increased. Otherwise, the decay interval was increased to deactivate cache blocks more conservatively. Ramaswamy and Yalamanchili [2007] varied the cache size by folding/merging cache sets with a long period of little or no activity and splitting sets if the number of accesses to the merged sets exceeded a preset threshold within a time interval. Since folding/merging cache sets caused the set indexes to differ from the original set indexes before folding/merging, a hardware-based lookup table maintained the new set indexes, which incurred one extra cycle. Veidenbaum et al. [1999] leveraged a base cache with a small physical block size and included the ability to dynamically fetch and replace a variable number of blocks simultaneously according to the exhibited spatial locality. Chen et al. [2004] also presented a configurable cache with variable cache block size. The block size was controlled by a spatial pattern predictor, which used access history information to predict future cache block usage and predicted the number of cache blocks to be fetched. In addition to fetching a multiple number of cache blocks simultaneously, the configurable cache enabled the fetching of subblocks that were predicted as to-be-referenced and deactivated subblocks that were predicted as unreferenced to save leakage energy. 3.2.3. Cache Size, Block Size, and Associativity Management. Since adjusting any of the cache parameters can impact energy consumption and increased cache configurability (more possible parameter values) results in increased energy savings, Zhang et al. [2004] designed a highly configurable cache with configurable size (using way shut down), associativity (using way concatenation), and block size (by fetching additional blocks through block concatenation). The authors designed a configuration circuit that used partial tag bits as inputs to implement way concatenation in logic. Since the configuration circuit was executed concurrently with the index decoding, the configuration circuit’s delay did not increase the cache lookup time or the critical path delay. Cache tuning was complicated since variations in the cache performance metric did not intuitively indicate which cache parameter should be downsized/upsized. Additionally, similarly to offline cache tuning (Section 2.2.1.6), exhaustive exploration was not feasible for online cache tuning due to lengthy cache tuning time. Runtime design space exploration typically executes the application for a period of time in each potential ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:24 W. Zang and A. Gorden-Ross Table II. Summary of Online Dynamic Cache Tuning Techniques Techniques Management techniques Way/bank management 1. Shut down some ways. 2. Concatenate ways. 3. Configure individual ways for different usage patterns. 4. Partition ways into level one and level two caches. 1. Shut down some blocks/sets. 2. Fetch/replace a variable number of blocks/sets simultaneously. Block/set management Size, block size, and associativity management 1. Way shutdown. 2. Way concatenation. 3. Block concatenation. Varied cache parameters Size Associativity Size Block size Size Block size Associativity Advantages 1. Small changes to a conventional cache. 2. Small hardware overhead. Drawbacks 1. Limited cache configurability 1. Provides finer 1. Finer granularity granularity management tuning. required. 2. More energy 2. More hardware savings due to overhead for finer finer granularity granularity tuning. tuning. 1. Increased 1. Large design space. configurability. 2. Large tuning time 2. Increased energy and tuning energy savings. even with heuristics. configuration until the best configuration is determined. However, during this exploration, many inferior configurations are executed and introduce enormous energy and performance overheads. Heuristic search techniques reduce the number of configurations explored, which significantly reduces the online cache tuning time and energy/performance overheads. Zhang et al. [2004] analyzed the impact of each cache parameter on the cache miss rate and energy consumption. According to the increasing order of the parameters’ impact on energy, Zhang developed a heuristic search for a single-level cache that determined the best cache size, block size, and then associativity. The impact-ordered heuristic search can prune tens to thousands of configurations from the design space. The number of pruned configurations depends on the number of configurable parameter values. Based on Zhang’s impact-ordered single-level cache tuning heuristic, GordonRoss et al. [2004] developed a two-level cache tuner called TCaT. TCaT leveraged the impact-ordered heuristic to interlace the exploration of separate instruction and data level-one and level-two caches. TCaT explored only 6.5% of the design space and determined cache configurations that consumed only 1% more energy than the optimal cache configuration. Gordon-Ross et al. [2009] extended TCaT for a unified second-level cache, which achieved an average energy savings of 61% while exploring only 0.2% of the design space on average. 3.2.4. Summary. Configurable cache architectures leverage circuit-level techniques to vary cache parameters. Based on the cache parameter adjustment capability, online cache tuning techniques can be classified as way/bank management, block/set management, and cache size, block size, and associativity management. Table II summarizes these classifications: the first column lists the techniques, the second column lists the techniques to manage the configurable cache architectures, the third column indicates the corresponding varied cache parameters, and the fourth and fifth columns indicate the corresponding advantages and drawbacks, respectively. Although the individual block/set management’s finer granularity potentially provides more energy savings as compared to the global block/set or way/bank managements’ coarser granularity, coarse granularity has several advantages over fine granularity. Coarse granularity management requires only small changes to a conventional ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:25 set-associative cache and introduces small hardware overhead. In finer granularity management, such as per-cache block management, a cache block access monitor (e.g., counter) and additional controller are required for each cache block. Comparatively, in coarse granularity management, only one global controller and monitor is required for all ways or for the entire cache. In addition, an address’s index scales well with changes in cache size, associativity, and block size in coarse granularity management. However, coarse granularity management limits the configurable cache design space. For instance, way/bank management is limited by the cache’s base associativity. Caches with a high base associativity provide more configurability than caches with a low base associativity, but high base associativity caches have longer cache access time and more tag storage space as a trade-off. For dynamic data cache tuning, modified cache blocks may complicate cache resizing. When a dirty cache block in a write-back cache is deactivated, the data consistency/coherency must be maintained in case the block is reactivated and the modified cache blocks are accessed again. The most straightforward technique is to flush all of the cache blocks before the blocks are disabled. However, this cache flush imposes large performance and energy overheads, especially for a multilevel cache hierarchy or a multicore cache, and diminishes the energy savings. To minimize cache flushing, Zhang et al. [2004] increased the cache size and associativity values instead of decreasing the values. For the state-preserving cache, Albonesi et al. [1999] provided an alternative technique instead of cache flushing. The cache coherence behaved as if the entire cache was enabled. In addition, the authors designed hardware that allowed the cache controller to read blocks from deactivated ways into activated ways using a fill data path and then invalidated the blocks in the deactivated ways. For the state-destroying cache, deactivating a dirty block requires a write-back. If multiple blocks are written back simultaneously as a result of deactivating a large portion of the cache, bus traffic congestion may occur due to the limited main memory bandwidth. Redistribute write-back techniques [Lee et al. 2000] that write dirty cache blocks to main memory prior to the block’s eviction/deactivation can be used to avoid this congestion. 3.3. Cache Tuning for Multicore Architectures Some single-core cache tuning techniques can be leveraged in multicore cache tuning, such as unused way/set shutdown, way concatenation, and merging/splitting cache sets. However, when adjusting the cache parameters, the interdependency between cores must be considered, such as shared data consistency and resource contention. Additionally, various multicore cache organizations bring new configuration opportunities and challenges, which increases the cache tuning complexity. Section 3.3.1 discusses efficient design space exploration techniques for heterogeneous cache architectures, where each core’s private cache can be tuned with a different size, block size, and associativity. These exploration techniques are limited to the first level of cache and techniques for multilevel caches and shared cache parameter and organization configurations are still open research topics. Most research has focused on the design and optimization of a CMP’s last-level cache (LLC) due to the large effects that the LLC has on system performance and energy consumption. The LLC is usually the second- or third-level cache, which is typically very large and requires long access latencies that may be non-uniform (e.g., non-uniform access architecture (NUCA) caches). In static-NUCA (S-NUCA), data blocks mapping to a cache bank are statistically determined by the block addresses. In dynamic-NUCA (D-NUCA), data blocks can migrate between banks based on the proximity of the requesting cores, and thus reduce the hit latency. As a trade-off, locating a desired block in D-NUCA may require searching multiple banks. Since the D-NUCA cache’s long data ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:26 W. Zang and A. Gorden-Ross migration time and searching operations increase the cache’s dynamic energy and degrade the system performance, cache tuning techniques for S-NUCA and D-NUCA have different challenges. Additionally, since large LLCs occupy a large die area and consume considerably high leakage power, deactivating unused portions of the LLC is important for energy savings. Shared LLCs have more efficient cache capacity utilization than private LLCs, because shared LLCs have no shared address replication and no coherence overhead for shared data modification. However, large shared LLCs may have lengthy hit latencies, and multiple cores may have high contention for the shared LLC’s resources even though the threads do not have shared data. Such contention generates high interference between threads and considerably affects the performance of individual threads and the entire system. To overcome the drawbacks of both shared and private LLCs, the cache capacity can be dynamically managed by assigning a variable amount of private and shared LLC capacity for each core. This dynamic sharing/partitioning potentially presents an important multicore cache tuning opportunity. Section 3.3.2 reviews techniques that dynamically partition the shared cache capacity between cores or partition private caches to allow cores to share the cache capacity. Section 3.3.3 summarizes this section’s contributions. 3.3.1. Efficient Design Space Exploration. Efficient single-core cache design space exploration (Section 3.2.3) provides a fundamental basis for multicore cache design space exploration, but multicore design space exploration introduces additional, complex challenges. In heterogeneous multicore systems, the design space grows exponentially with the number of cores. Cores without data sharing can leverage single-core cache-tuning heuristics individually; however, cache tuning should not simply commence on each core simultaneously. Since cache tuning incurs energy, power, and performance overheads while executing inferior, non-optimal configurations, performing cache tuning on each core simultaneously may introduce a large accumulated overhead. Additionally, cores executing data-sharing applications cannot be tuned individually without coordinating the tuning and considering the cache interactions, which have circular tuning dependencies, wherein tuning one core’s cache affects the behavior of the other cores’ caches. For example, increasing the cache size for one core increases the amount of data that the cache can store and decreases the miss rate. However, this larger cache may store more shared data, which increases the number of cache coherence evictions and forced write backs for all other cores. Even though several previous works optimize an entire heterogeneous multicore architecture using design space exploration, we limit our survey to cache tuning techniques only. Based on Zhang et al.’s single-core configurable cache architecture and cache tuning techniques [2004], Rawlins and Gordon-Ross [2011] developed an efficient design space exploration technique for level-one data cache tuning in heterogeneous dual-core architectures, where each data cache could have a different cache size, block size, and associativity. Due to the cache interactions introduced by shared data, independently tuning the three parameters for each cache (similarly to previous single-core heuristics) was not sufficient. Additional adjustments were required based on certain conditions surmised from empirical observations. The heuristic determined cache configurations within 1% of the optimal configuration, while searching only 1% of the design space. The authors extended this heuristic for data cache tuning in heterogeneous architectures with any number of cores [2012]. In that work, Rawlins and Gordon-Ross classified the applications based on data sharing and cache behavior and used this classification to guide cache tuning, which reduced the number of cores that needed to be tuned. The heuristic searched at most 1% of the design space and determined configurations within 2% of the optimal configuration. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:27 3.3.2. LLC Partitioning. Shared and private LLCs both have advantages and drawbacks with respect to the cache capacity utilization efficiency and thread interferences, thus combining the advantages of both into a hybrid architecture is required. In a private LLC, the cache capacity is allocated to each core statically. If the application running on a core is small and the allocated private cache capacity is large, the excess cache capacity is wasted. Alternatively, if the application running on a core is large and the allocated cache capacity is small, performance may be compromised due to a large cache miss rate. In a shared LLC, the cache capacity occupied by each core is flexible and can change based on the applications’ demands. However, if an application sequentially processes a large amount of data, such as multimedia streaming reads/writes, the application will occupy a large portion of the cache with data that is only accessed a single time; thus a large cache capacity may not reduce the cache miss rate for this application. As a result, the coscheduled applications running on the other cores experience high cache miss rates due to shared cache contention. However, in a shared LLC, the capacity can be dynamically partitioned, and each partition is allocated to one core as the core’s private cache, and this private cache’s size and associativity can be configured. Alternatively, private LLCs can be partitioned to be partially or entirely shared with other cores. Cache partitioning consists of three components, the cache partitioning controller, system metric tracking, and the cache partitioning decision, with functionalities similar to the components in online cache tuning for single-core architectures (Section 3.2) and can be implemented in the OS software or in hardware. The cache partitioning controller can be implemented using a modified replacement policy in a shared LLC or a coherence policy in a private LLC. By placing replacement constraints on the candidate replacement blocks in a cache set and the insertion location for each core’s incoming blocks, an individual core’s cache occupancy can be controlled. The cache can also be directly partitioned by the OS using OS-controlled page placement. System metric tracking requires tracking metrics for the entire system or each core. One possible solution is to use dynamic set sampling [Qureshi and Patt 2006]. Dynamic set sampling samples a group of cache sets and maintains the sets’ tags as if the sampled sets were configured with one potential configuration in the design space and/or as if the sampled sets were entirely allocated to one core. The sampled sets are used to approximate the global performance of the entire cache. With multiple groups of sampled sets such that each group represents each possible configuration, the best configuration can be determined directly. Since the number of sampled sets increases exponentially with the number of cores, most previous works designed novel techniques to reduce the hardware overhead. Another system metric tracking technique that is similar to single-core cache tuning monitors the cache access profile in hardware or software and analyzes each core’s cache requirements and interactions with other cores. Previous works present two common cache partitioning decision optimization targets. One target is to optimize the overall performance, such as the summation/average of raw/weighted IPCs or cache miss rates. The other target is to maintain performance fairness across the cores. Hsu et al. [2006] compared and indicated that the cache partitions could vary greatly with the two different targets. However, the correlation between the two optimization targets and energy consumption has not been investigated, although compromising performance and energy consumption is an interesting research topic. Additionally, combining cache partitioning and cache configuration (such as deactivating partial cache ways/sets) can enhance the performance and energy optimization. Since determining the optimal solution to resource partitioning is an NP-hard problem [Rajkumar et al. 1997], greedy and heuristic techniques [Suh et al. 2004; Qureshi and Patt 2006] are used to make a quick cache partitioning decision. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:28 W. Zang and A. Gorden-Ross In NUCA architectures where non-uniform latencies exist in accessing different ways, proximity-aware cache partitioning [Liu et al. 2004; Yeh and Reinman 2005; Huh et al. 2007; Dybdahl and Stenstrom 2007] is preferred to reduce hit latency. In a distributed shared LLC, either an entire way or set is located in one cache bank, allowing the cache to be partitioned either on a way or set basis. In this survey, we do not review shared LLC bandwidth partitioning, although limited shared LLC bandwidth can be a performance and energy bottleneck. 3.3.2.1. Way-Partitioning. In way-partitioning, the LLC is partitioned using block replacement management, and the way partitions are assigned to the cores either physically or logically to ensure each core occupies no more than the cores’ quota. Physical way-partitioning is on a way-granularity and uses ways as partition boundaries. The physical way-partitioning controller enforces cache blocks from each core to reside in fixed physical ways. Even if an application running on one core does not utilize the application’s assigned quota temporally, the other cores cannot leverage this vacant cache capacity. Logical way-partitioning is on a block-granularity and logically allocates blocks to each core. Partitioning boundaries are not strictly enforced and a core can utilize another cores’ way quota if the other core has vacant cache capacity. The logical way-partitioning controller is implemented using a modified replacement policy. Since way-granularity is coarser than block-granularity, way-granularity offers a smaller design space than block-granularity. Since physical way-partitioning restricts cores from replacing another core’s blocks, single-accessed blocks may pollute the cache for a long period of time. Logical waypartitioning alleviates this problem using a specially-designed replacement policy [Jaleel et al. 2008b; Xie and Loh 2009], which achieves better cache utilization as compared to physical way-partitioning. However, the cache partitioning control and block lookup in physical way-partitioning is usually less complex than in logical waypartitioning. For example, a common physical way-partitioning implementation is column caching [Chiou et al. 2000], where each core is associated with a bit vector in which the bits specify the replacement candidates. This bit vector does not introduce a cache lookup penalty since the partitioned cache access is precisely the same as the cache access in a conventional cache. In logical way-partitioning, block replacement candidates may belong to other cores, thus determining the replacement candidate is not as straightforward as in way-granularity cache partitioning. Physical Way-Partitioning. Qureshi and Patt [2006] developed Utility-based Cache Partitioning (UCP). A monitor tracked the cache misses for all possible number of ways for each core using dynamic set sampling. The monitor maintained the tags of only a few sampled sets, as if these sets were entirely allocated to one core. By associating a hit counter with each way in the sampled sets, the marginal utility (reduction in cache misses by having one more way) of each way for every core was calculated, and the partitioning with the minimal total number of cache misses was selected. Greedy and refined heuristics were used to determine the cache ways’ assignments to the cores. Iyer [2004] classified the applications into different priorities according to the application’s degree of locality and latency sensitivities. A single LLC was partitioned on a set/way basis and could be divided into multiple small heterogeneous caches, whose organization and policies were different. The OS allocated cache capacity to each application based on the monitored performance and the application’s priority. Varadarajan et al. [2006] divided the cache into small direct-mapped cache units, which were assigned to each core for exclusive usage. In each cache partition, the cache size, block size, and associativity were variable. Kim et al. [2004] developed cache partitioning for fairness optimization using static and dynamic techniques. Static cache partitioning utilized the stack distance profile of cache accesses to determine the cores’ ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:29 requirements. Dynamic cache partitioning was employed for each time interval. By monitoring cache misses, a core’s cache partition was increased or decreased as long as a better fairness could be obtained. If there was no performance benefit after one time interval with the newly partitioned results, a rollback was invoked to revert to the previous partitioning decision. Each of these cache partitioning techniques were proposed to partition a shared LLC. Lee et al. [2011] developed the CloudCache, which was an LLC physically composed of private caches, but the virtual private cache capacity for each core could be increased/decreased using the physical private cache banks and the neighboring cores’ cache banks. Similarly to Qureshi and Patt [2006], the system metric tracking used dynamic set sampling to track the cache misses with a different number of ways for each core. A proximity-aware placement strategy was used such that the closest local bank stored the MRU block, and the farthest remote banks stored the LRU block. In contrast to partitioning a shared LLC to take advantage of a private cache’s features, MorphCache [Srikantaiah et al. 2011] merged private caches to leverage shared cache advantages. MorphCache dynamically merged private level-two and private level-three caches and split the merged caches to form a configurable cache organization. The private caches could remain private or be merged as shared caches amongst 2, 4, 8, or 16 cores. If a core overutilized its private cache and another core underutilized its private cache, the two caches were merged to obtain a moderate utilization. In addition, if two cores had a large amount of shared addresses, merging the two private caches as one shared cache could reduce the data replication and coherence overhead. Huh et al. [2007] partitioned the cache evenly into several partitions, where each partition was shared by any number of cores. Coherence was maintained using a directory located at the center of the chip, and the allocation was data proximity-aware. Liu et al. [2004] and Dybdahl and Stenstrom [2007] also investigated proximity-aware cache partitioning. Logical Way-Partitioning. In logical way-partitioning, the number of blocks used by individual cores or shared by multiple cores in different cache sets may be different and change in accordance with real-time accesses from different cores. Logical waypartitioning has a fine block-granularity. Suh et al.’s [2004] cache partitioning technique used marginal utility estimation for each core. Similarly to Qureshi and Patt [2006], dynamic set sampling tracked each core’s LRU information to calculate the reduction in cache miss rate by adding one more partition unit (cache way or block). A greedy algorithm determined the best cache partition by identifying the core that received the greatest cache miss rate improvement by adding one more partition unit and the core that suffered the least cache miss rate degradation by removing one partition unit. If the improvement outweighed the degradation, the core that received the most benefit was allocated one more partition unit, and the core that suffered the least degradation gave up one partition unit. Instead of simply partitioning cache ways similarly to column caching [Chiou et al. 2000], Suh’s cache partitioning controller modified the LRU replacement policy. If the core’s actual cache occupancy was larger than the core’s block quota, the core’s LRU block was replaced. Otherwise, the core could replace another core’s LRU block if the other core over-occupied its quota. Managing cache insertion enables implicit, pseudo-partitioning of a shared cache. Jaleel et al. [2008b] logically partitioned a shared LLC by selecting the insertion position for incoming blocks from each core such that the lifetime of the blocks existing in the cache was controlled by the insertion policy. This technique was based on the dynamic insertion policy (DIP) developed by Qureshi et al. [2007]. In a conventional LRU replacement policy, the new block is always inserted into the MRU position, which is referred to as an MRU insertion policy (MIP). DIP uses both MIP and BIP (bimodal ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:30 W. Zang and A. Gorden-Ross insertion policy). In BIP, incoming blocks are inserted into the MRU position with a small probability, and the majority of the blocks are inserted into the LRU position and have a short cache lifetime. The number of cache misses for each policy was monitored using dynamic set sampling. Although DIP allowed each core to select between two insertion policies, the number of sampled sets increased exponentially with the number of cores. The authors proposed an approach to independently determine the best insertion policy for each core and then improved the approach by considering the interference of the insertion policy decision among the cores. Results showed that DIP had a 1.3X performance improvement as compared to UCP [Qureshi and Patt 2006]. Xie and Loh [2009] modified both the insertion policy and the rule of updating the MRU stack in the cache sets. The incoming block could be inserted at an arbitrary position depending on the core’s allocated cache capacity. For instance, the block was inserted to a position close to the top of the stack (in a conventional LRU policy, the MRU position is on the top of the stack) if the core was allocated a high cache capacity. On a cache hit, the accessed block was moved up one position in the stack with a certain probability instead of directly moved to the top of the stack. In private LLCs, cooperative caching [Chang and Sohi 2006] is a technique that implicitly partitions the cache and attempts to store data in the on-chip cache as much as possible. Although the caches were private, the equivalent capacity was comparable to a shared cache. Rather than directly evicting blocks from a core’s private cache, other cores’ private caches could store evicted blocks from another core if the cores’ cache had spare capacity. In addition, in order to reduce data replication across the private LLCs, blocks that had replicas in remote caches were selected for eviction before the LRU block. According to the eviction control, the aggregate private LLCs, behaved similarly to a shared cache with respect to the efficiency of capacity utilization. By controlling the probability of retaining evicted blocks in remote caches and the probability of evicting block replicas, the sharing portion could be adapted to the application’s requirements. Beckmann et al. [2006] developed a hardware-based technique to dynamically determine the probability of evicting block replicas by keeping track of the number of hits to the LRU blocks, replicas, and recently evicted blocks to estimate the benefit of increasing/decreasing the replica percentage. Qureshi [2009] proposed dynamic spill-receive caching for private LLCs, where caches were designated as spiller caches whose evicted blocks could be stored into another cache or as receiver caches that received the evicted blocks. To determine if a cache should spill or receive, dynamic set sampling tracked the cache misses such that in each private cache, one group of sampled sets acted as always-spill and another group of sampled sets acted as always-receive. The decision to spill or receive was made based on which group of sampled sets yielded fewer misses. Even though some of the logical way-partitioning techniques may not be strictly classified as way-partitioning, especially for the private LLC’s cooperative caching, we classify these techniques as way-partitioning due to the use of a modified replacement policy. 3.3.2.2. Set-Partitioning. LLC set-partitioning can be implemented using either dedicated hardware or OS-based page placement. OS-based set-partitioning requires little or no hardware support, thus is easily realizable in modern architectures. The OS statically places pages in cache banks or dynamically changes the page placement by migrating pages across banks with some extra hardware cost. OS-based set-partitioning is on the OS page-granularity, and the page placement in an NUCA cache can be proximity-aware by placing pages in the cache bank closest to the requesting core. Hardware-Based Set-Partitioning. Srikantaiah et al. [2008] allocated the LLC’s sets across the cores to reduce the cache misses caused by interference. By classifying the cache misses as intracore misses (the missed block was evicted from the cache ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:31 due to fetching another block from the same core) and intercore misses (the missed block was evicted from the cache due to fetching another block from a different core), the authors observed that most of the intercore misses were introduced by a few hot blocks. Therefore, the authors designed set pinning, which prohibited the large number of accesses to the few hot blocks to evict cache blocks by storing the hot blocks into small cache regions owned privately by the cores that first accessed the block. In addition, a set in the shared cache was assigned to the core that first accessed the set, and only the core with ownership of the set could replace blocks in the set. Since this first-touch allocation policy could result in unfairness, a fairer adaptive policy was adopted such that cores could relinquish the ownership of sets if non-owner cores yielded a large number of misses in the set. OS-Based Set-Partitioning - Page Coloring. In conventional cache architectures, cache blocks are uniformly distributed across cache banks based on the blocks’ physical addresses in order to improve the cache bandwidth and cache capacity utilization; however, this block-granularity mapping may not be advantageous for multicore architectures. In multicore architectures, this mapping can significantly increase cache access latency and network traffic, since accessing contiguous memory blocks traverses most of the cache banks, and this mapping has no data proximity consideration for requesting cores. If the mapping is on the page granularity, consecutive blocks within a page can reside in a single bank. OS-based page coloring [Kessler and Hill 1992] enables full control of page placement in the cache banks using virtual-to-physical mapping. The placement of a virtual page is dictated by the page’s assigned physical page address. In an OS-based set-partitioning cache, sets are distributed across banks, and an entire set resides in one bank. Therefore, in cache addressing, the few most significant bits in the set index dictate the bank, and the remaining bits dictate the set location in the bank. In page coloring, each bank is associated with one color and a free list keeps track of the available pages in each bank. When a new page is required, the OS determines the optimal bank for the page and allocates a free page from the list by creating a virtual-to-physical mapping. Cho and Jin [2006] proposed OS-based set-partitioning for shared LLCs. In multithreaded applications, where the accessed data in each core is mostly private, when a new page was requested, the page was directly allocated to the requesting core’s closest bank and served as a private cache, thus providing proximity-aware caching. For multithreaded applications, where a page was shared by multiple cores, the shared page was allocated as a single instance in one bank, and the OS determined the optimal proximity bank based on the average distance to the page’s requesting cores. If the local bank was too small to store the owner core’s application, subsequent page allocations for that core could borrow cache capacity from neighboring cache banks. Similarly, if there were many actively accessed pages in one bank, which meant the bank was suffering heavy cache pressure, spreading pages to neighboring cache banks was necessary. In the first-touch page allocation, the page placement does not change dynamically during a phase. When a phase change is detected (Section 3.4), especially when the application running on one core migrates to another core or the page sharing changes dramatically, page remapping is required. However, migrating pages requires copying the pages to main memory, which introduces high performance and energy costs. Awasthi et al. [2009] developed a technique to eliminate high page migration costs by preserving the page’s original physical memory location. Since accessing the pages should use new addresses, the TLB did not directly map the virtual address to the original physical address and instead produced the new address using a translation table. When page migration occurred due to remapping, first the page in the cache was flushed and ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:32 W. Zang and A. Gorden-Ross then re-loaded into the cache with the page’s new addresses. To reduce the large space required by the translation table, Lin et al. [2009] used a coarser granularity that remapped an entire cache region consisting of multiple physical pages. Hardavellas et al. [2009] used the OS to classify the pages into three categories: shared instruction pages that were replicated in each core’s local banks without any coherence mechanism, private data pages that were directly placed in the local cache banks of requesting cores, and shared data pages that were kept as one instance in a single bank to avoid coherence overhead and multiple pages could be equally distributed across all banks. 3.3.3. Summary. Data address sharing and shared resource contention in multicore architectures presents numerous cache tuning challenges. In this section, we reviewed efficient design space exploration techniques for heterogeneous first-level private caches with varying cache size, block size, and associativity. Since shared data interference and coherence in multicore architectures complicates cache tuning, cache tuning heuristics for these architectures must also consider tuning the cache organization, designating sharing and private partitions in the last-level cache, and proximity-aware partitioning and allocation of cache partitions to each core in NUCA caches to maximize performance and minimize energy consumption. We reviewed tuning techniques at the cache-partitioning level, including both explicit and implicit partitioning techniques, such as modified replacement and insertion policies for shared caches and cooperative caching for private caches. Unlike conventional cache partitioning techniques, where each partitioned portion is exclusively used by one core, the reviewed cache partitioning techniques allowed partitions to be optionally shared by any number of cores. However, although cache optimization for multicore architectures has been investigated, cache tuning that combines last-level cache partitioning and first-level cache configuration is still an open research topic. Additionally, cache tuning for SMT processors while considering other resource contentions has not been researched thoroughly. Table III summarizes the cache partitioning techniques reviewed in Section 3.3.2. The first column lists the target cache architectures to be partitioned, and the second, third, and fourth columns give the partitioning techniques, the benefits, and partition granularity, respectively, for each architecture. Although we reviewed the techniques with respect to way- and set-partitioning, we summarize the techniques based on a different aspect. Since the main objective of cache partitioning is to strengthen the private and shared cache organizations to achieve the collective benefits of both types of cache organizations, we summarize the techniques with respect to shared and private LLCs, with OS-based page coloring summarized separately. 3.4. Phase-Change Detection An application phase is defined as the set of intervals within an application’s execution that have similar cache behavior [Sherwood et al. 2003b]. In order to adapt the cache configuration to different application phases during runtime, online cache tuning must detect and react to phase changes to determine the new optimal cache configuration for the next phase. Accurately detecting the precise phase change moment is both critical and difficult [Gordon-Ross and Vahid 2007]. If the phase-change detection mechanism is not sensitive enough and the next phase’s optimal cache configuration provides significant energy savings, missing the phase-change will waste energy because the system will execute in a suboptimal configuration. Alternatively, if the phase-change detection is too sensitive, the overhead due to too frequent cache tuning may consume more energy than simply running in a fixed base cache configuration for the application’s entire execution. Phase-change detection can be performed either online or offline. Online phasechange detection has been the focus of much online algorithm research [Dhodapkar ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:33 Table III. Summary of Cache Partitioning Techniques Target architecture Shared LLC Partitioning techniques Benefits Allocate physical ways/sets to each 1. Isolated capacity for each core avoids core. shared cache contention. 2. Adjustable “private” cache size/associativity. Modified replacement and 1. Better utilization of cache capacity by insertion policies. evicting unused blocks that belong to other cores. 2. Occasional long lifetime of some blocks helps reduce cache misses (since in some techniques, each incoming block’s insertion location is nondeterministic, which is dictated with a probability). Private LLC Physically shrink/expand private 1. Adjust private cache size/associativity caches. by borrowing from other cores’ cache capacities. Physically merge several private 1. Reduces shared address replication. caches into a shared cache. 2. Avoids coherence overhead. Cooperative caching: 1. Reduces shared address replication. Retain evicted blocks in other 2. Better cache capacity utilization. cores’ caches. Replace the block with replicas in other cores’ caches instead of replacing the LRU block. OS-based Allocate private pages to the 1. Data proximity-aware to reduce hit page coloring requesting core’s closest banks. latency. Replicate read-only shared pages 2. Isolated capacity for each core avoids to the requesting core’s closest shared cache contention. cache banks. 3. Better cache capacity utilization. Shared data pages are stored as 4. Reduces coherence overhead one instance in a single bank 5. No hardware complexity overhead but that is close to all the requesting requires OS modification. cores. Partition granularity Way/set Block Way Block OS page and Smith 2002; Gordon-Ross and Vahid 2007; Huang et al. 2008; Huang et al. 2003; Sherwood et al. 2003a]. Online algorithms process input as the input arrives over time, thus online algorithms do not have knowledge of all of the inputs at the start of execution. Online phase change detection predicts future phase changes and phase durations (the length of time between phase changes) based on current and past system behavior. Some configurable cache architectures have integrated-hardware for online phase change detection. The phase change detection hardware monitors system behavior and performance and initiates cache tuning based on the changes in the observed behavior. Since cache tuning is triggered after a variation in system behavior and performance is observed, online phase change detection is a reactive technique. During the reaction time (the period of time before the system metrics reflect the behavior change), the cache configuration remains the same as the cache configuration from the previous phase, which wastes energy. If an application’s behavior has significant variability, reactive phase-change detection will suffer from continuously lagging behind the phase change. Thus, the optimal system is an oracle that immediately recognizes phase changes and then triggers the cache tuning. Alternatively, offline phase change detection is a pro-active technique. Designers use offline phase change detection analysis tools to determine the phase changes (typically denoted by particular instructions) and then provide these predetermined phase changes to the cache-tuning hardware to invoke cache tuning. Designers can also leverage offline cache-tuning techniques to ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:34 W. Zang and A. Gorden-Ross statically determine the optimal cache configuration for each predetermined phase and provide these predetermined configurations to the cache-tuning hardware, which eliminates runtime design space exploration. One possible implementation is to leverage a compiler or linker to insert special instructions into the application binary at each phase change to automatically update the cache configuration registers and reconfigure the cache. Application execution is analyzed for a phase change at either a fixed or variable interval. During each interval, the phase analysis technique collects system metrics and analyzes these metrics at the end of each interval to determine if a phase change has occurred. Previous works in fixed-interval analysis explored intervals ranging from 100,000 [Balasubramonian et al. 2000] to 10 million instructions [Sherwood et al. 2003a, 2003b] or from 10 milliseconds to 10 seconds [Duesterwald et al. 2003]. Finding the best interval length for an application with a particular input stimuli is challenging except in situations where the interval length is not critical, such as long and stable phases. Some phase-change detection techniques [Huang et al. 2003; Shen et al. 2004; Lau et al. 2006] do not use fixed intervals, but instead react directly to variations in system metrics or application structure. Previous work revealed that applications tended to revisit phases [Sherwood et al. 2003b], which resulted in a technique called phase classification. Phase classification classifies intervals based on the interval’s system metrics and same-class intervals have the same optimal cache configuration. Based on the execution order of past phases, future intervals can be predicted as belonging to a particular phase class using Markov [Sherwood et al. 2001] or table-driven predictors [Dropsho et al. 2002a; Duesterwald et al. 2003; Sherwood et al. 2003a]. Both phase classification and phase prediction enables predetermination of the optimal cache configuration for future intervals, thereby eliminating cache tuning for same-class phases. The typical system metrics used to detect phase changes include system performance (e.g., cache misses, cache access delays, IPC (instructions per cycle), and system power), application structures (e.g., number of branches, looping patterns, number of instructions within a loop, working set analysis counters), and memory accesses. System performance-based phase-change detection leverages a system performance monitor to track any performance degradation incurred by using the current cache configuration to determine whether a new cache configuration is required. This detection technique is generally used for online phase-change detection using custom-designed hardware for runtime system performance monitoring. Additionally, this detection technique can also be performed offline using an offline profiler for system performance analysis. Application structure-based phase-change detection marks a subset of loops and functions as basic elements, and divides the elements into phases according to statistical characteristics. This detection technique is more complicated than system performance-based phase-change detection due to a typically large amount of complex loops and functions. However, unlike system performance-based phase-change detection that depends on detecting system metric changes, application structure-based phase-change detection depends on the variation in the application code, which potentially produces input set-, system architecture-, and optimization objective-independent phases [Lau et al. 2006]. Application structure-based phase-change detection can be used both online and offline. Memory access-based phase-change detection leverages the reuse distance of memory accesses to detect phase changes. Since the cache behavior directly depends on the memory access pattern, memory access-based phase-change detection is particularly effective for cache tuning, but can only be used for offline phase-change detection due to complicated memory access analysis. In the following sections, we present previous works in online and offline phasechange detection for single-threaded applications and multithreaded applications. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:35 Current metric accumulator History table System execuation Comparitor Reconfiguration Control Phase change Variation threshold Fig. 6. Overview of phase-change detection. 3.4.1. Online Phase-Change Detection. Online phase-change detection is autonomous and transparent to designers. Phase-change detection is designed to dynamically detect phase changes and trigger cache tuning. Figure 6 depicts an overview of one possible hardware solution for phase-change detection. During system execution, a system metric, which records system performance or application structure characteristics, is accumulated during a fixed or variable interval. The accumulator consists of one or more system metric counters (shown as buckets divided by dash lines in the figure). The number of counters depends on the dimensionality used to represent the system metric. A history table maintains past system metric history from all previous intervals [Dhodapkar and Smith 2002; Sherwood et al. 2003a] or from only the last interval [Balasubramonian et al. 2000]. The system metric accumulated for each interval or phase experienced previously is recorded in one column of the history table. The phase-change detection unit compares the current system metric with the system metric’s history to predict phase changes [Balasubramonian et al. 2000] or perform phase classification [Dhodapkar and Smith 2002; Sherwood et al. 2003a, 2001] for future intervals. A system metric variation threshold is set to tolerate intra-phase system metric fluctuations to avoid triggering cache tuning too frequently (i.e., no system metric remains precisely constant within a phase, and thus intraphase fluctuations should not trigger cache tuning). When a phase change is detected, the cache-tuning hardware is triggered and tunes the cache based on the collected system metric or phase classification results. Some simpler phase-change detectors [Dropsho et al. 2002a; Powell et al. 2001] do not use a system metric history table and instead compare the current system metric with a predetermined bound, and cache tuning is triggered when the system metric exceeds the bound. Several cache architectures for online cache tuning discussed in Sections 3.2 and 3.3 have integrated online phase-change detection, which we will elaborate on in the remainder of this section. Additionally, we will review recently proposed system performance- and application structure-based phase change detection techniques. System Performance-Based Phase-Change Detection. In Powell’s DRI cache [Powell et al. 2001], the cache miss rate was compared to a cache miss rate bound at a fixed interval to detect phase changes and resize the cache. If the cache miss rate exceeded the cache miss rate bound, additional cache partitions were activated, and if the cache miss rate fell below the cache miss rate bound, cache partitions were deactivated. Dropsho et al. [2002a] calculated and maintained delay counters that recorded the delays in accessing the partitioned primary and secondary ways of the cache, which indicated phase changes. The delay tolerance threshold was fixed and predetermined. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:36 W. Zang and A. Gorden-Ross Balasubramonian et al. [2000] detected phase changes using hardware counters to collect cache miss rates, IPC, and branch frequencies for each fixed interval. Successive intervals were compared by calculating differences in the number of cache misses and branches. These differences were compared with a threshold to detect a phase change. If the current interval’s behavior was sufficiently different from the last interval’s behavior, cache tuning was triggered. Even though this phase-change detection technique could dynamically adjust the threshold, selecting the threshold’s initial value was difficult for guaranteed convergence. Application Structure-Based Phase-Change Detection. Sherwood et al. [2003a] used basic block vectors (BBVs) to detect and classify phases. An accumulator captured branch addresses and the number of instructions executed between branches. At the end of a fixed interval, the branch information was compared with past phases for phase classification, and the interval’s phase classification was recorded in a past footprint table. After detecting the phase changes, the next phase’s duration (phase’s can span multiple intervals) was predicted using a Markov predictor. In phase-based cache tuning, each phase’s optimal cache configuration was stored in a table so that when the phase was reentered, the cache could be configured directly to the previously recorded optimal cache configuration. Dhodapkar and Smith [2002] proposed hardware for detecting and classifing phases by tracking instruction working set signatures (i.e., a sample of instruction addresses accessed during a fixed interval), and a phase change occurred when the working set changed. Dhodapkar and Smith [2003] also compared their technique with the BBV technique [Sherwood et al. 2001] and concluded that the instruction working set technique was more effective in detecting major phase (whose duration is larger than a preset threshold) changes, while the BBV technique provided higher sensitivity and more stable phase classification. Since between long stable phases there is usually a short transition phase that exhibits unique behavior, Lau et al. [2005] designed phase prediction hardware that identified transition phases. Duesterwald et al. [2003] observed that the phase periodic behavior was shared across metrics and proposed a cross-metric predictor that efficiently integrated multiple metric predictors to predict phase changes. Huang et al. [2003] proposed stack-based hardware to identify application subroutines by tracking procedure calls. A function-call stack tracked subroutine entries and returns to indicate the execution time spent on each subroutine. If the execution time spent in a subroutine was greater than a threshold value, the subroutine was identified as a major phase. 3.4.2. Offline Phase-Change Detection. Reactive techniques, such as online phase-change detection, are effective when applications exhibit stable behavior/phases across several successive intervals, because the overhead of cache tuning or lagging phase change detection can be amortized over the long, stable phase. However, applications that exhibit significant variability (frequent phase changes and/or short phases) are not amenable to reactive techniques, because the phase-change detection process and cache tuning may consume the entire phase, leaving no time to execute in the optimal configuration. In addition, online phase change detection may require additional hardware overhead, resulting in increased execution time and energy consumption. Alternatively, offline phase change detection is a proactive technique, because offline analysis can analyze the entire application and pinpoint the phase changes such that cache tuning can be triggered exactly when the phase change occurs. Since an oracle leverages offline phase-change detection, all future interval behavior is known a priori, and runtime overheads are reduced, such as system metric collection and storage, and does not interfere with application execution. However, offline phase-change detection requires significant design time analysis and is typically based on trace/event profiling, ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:37 which may be input-dependent, except for some application structure-based phasechange detection techniques [Lau et al. 2006]. In this section, we present previous works in offline phase-change detection categorized as system performance-, application structure-, and memory access-based. System Performance-Based Phase-Change Detection. System performance-based phase change detection is a common online phase change detection technique (Section 3.4.1.) that can also be leveraged offline with little modification. Offline performance profilers track performance changes with respect to system metrics, and an offline analysis tool analyzes these system metrics to detect phase changes similarly to the online techniques. Application Structure-Based Phase-Change Detection. Sherwood et al. [2001] evaluated applications’ periodic behavior based on basic block distribution analysis. This analysis recorded the number of times each basic block was executed and then determined phase changes. In subsequent work, Sherwood et al. [2003b] used BBVs to detect and classify phase changes offline and chose a single representative interval (an interval that best represents the phase behavior) for each phase to use as an offline simulation point for the phase. Additionally, the authors developed an accurate and efficient tool, SimPoint [Hamerly et al. 2005], which has been widely used in offline phase analysis [Benitez el al. 2006; Gordon-Ross et al. 2008; Sherwood et al. 2003a; etc.]. Lau et al. [2004] investigated the trade-offs between using applications structures with BBVs, loop branches, procedures, opcodes, register usage, and memory addresses for phase classification and concluded that using register usage vectors and loop vectors were as efficient and effective as using BBVs. By using the application’s procedures and loop boundaries, Lau et al. [2006] selected software phase markers to signal phase changes and inserted these markers into the application with a static or dynamic compiler. The authors also integrated the software phase markers into SimPoint to create variable length intervals, and generated simulation points that mapped to the application code, which could be reused across different inputs and compilations for the same application code. Memory Access-Based Phase-Change Detection. Application structure-based phasechange detection can be inaccurate if different iterations of the same loop have significantly different behavior due to different input data/stimuli. Alternatively, phase change detection based on a memory access trace can increase the phase-change detection accuracy for cache tuning, since the reuse distance of memory accesses directly dictates the cache miss rates according to the discussion in Section 2.2.2.2. Shen et al. [2004] used reuse distance patterns to detect phase changes using wavelet filtering. Ding and Zhong [2003] studied an application’s reuse distance patterns for different inputs, and Shen et al. [2005] developed phase-based cache miss rate prediction for different application inputs by building a regression model. 3.4.3. Phase Change Detection for Multithreaded/Multicore Architectures. In SMT and CMP architectures, overall-phase (a phase in a multithreaded/multicore applications’ overall execution) detection and classification are more complex than in single-threaded singlecore processors, since changing the cache parameters may change the coscheduling of applications, leading to different overall cache behavior. Furthermore, in real systems, it is unlikely that two applications will begin execution at the same time due to different scheduling times for each application. Thus, the coscheduled portions of concurrent applications will dynamically change with the OS scheduling during runtime. Previous works on overall-phase change detection and classification for SMT and CMP architectures are based on the classic BBV technique [Sherwood et al. 2003b] for single-cores. Biesbrouck et al. [2004] designed a cophase matrix for SMT architectures. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:38 W. Zang and A. Gorden-Ross The authors stated that since the single-threaded phase behavior was still valid in multithreaded execution with shared resource contention, the BBV technique could be directly leveraged for phase analysis for each single-threaded application. If any thread contained a phase change, the overall-phase changed. The authors used a cophase matrix to record all combinations of each single-threaded phase’s representative interval when multiple threads executed simultaneously. The change in execution time for a single-thread’s phase due to co-execution with other threads was calculated based on changes in the IPC. Perelman et al. [2006] detected phases for parallel applications on CMPs by concatenating the interval traces for each thread and determining the overall-phase behavior across threads using a BBV technique. Using this technique, the intervals for different threads could be classified as the same overall-phase, and this overall-phase information was mapped back to the parallel execution. The stalls introduced by synchronization were considered when forming the fixed-length intervals such that the interval did not cross a synchronization point. Similarly to Biesbrouck et al. [2004], a phase change in any single thread indicated an overall-phase change. However, the number of overall-phases and corresponding representative intervals were reduced. To further combine the overall-phases, Namkung et al. [2006] classified the phase combinations recorded in the cophase matrix using the Levenshtein distance to calculate similarity. Kihm and Connors [2005] provided a mathematical model to correctly weight and accumulate each of the thread phases for overall performance evaluation by considering the changes in overall-phases with respect to different start time offsets for the single-thread phases. The majority of previous works on overall-phase analysis were based on application structure-based phase change detection (i.e., BBVs [Sherwood et al. 2003b]), because although shared resource contention impacts the performance and timing-related behavior of single-threads, the application itself is nearly unaffected, except for synchronization stalls and varied phase lengths. Therefore, the single-thread phases determined by application structure-based techniques are likely valid when co-executing with other threads. However, extending previous works to consider data interferences, sharing, and coherence between threads is nontrivial. 3.4.4. Summary. Application- and phase-based cache tuning must determine the optimal (near optimal) cache configuration quickly and accurately, since the energy overhead incurred during cache tuning can be significant while evaluating suboptimal configurations. A quick cache tuning time is particularly important for phase-based cache tuning, since a phase’s execution may be short and cache tuning must complete quickly enough to amortize the energy overhead incurred during tuning. In phase-based cache tuning, phase-change detection can be transparently incorporated into configurable cache architectures to trigger cache tuning for different phases. We reviewed phase-change detection techniques from an online and offline perspective, wherein the phase changes are detected using three techniques: system performance-, application structure-, and memory access-based. Table IV summarizes these three techniques: the first column lists the techniques, the second column gives the characteristics of each technique, and the third and fourth columns indicate whether the techniques can be used online or offline, respectively. For phase-based cache tuning, the phase-change detection technique must identify phases with an appropriate granularity based on application behavior. If the granularity is too fine, significant overhead can be incurred during redundant cache tuning and if the granularity is too coarse, phase changes may not be detected, and overhead can be incurred while the system executes in suboptimal configurations. Robust phasechange detection techniques can vary the phase granularity during runtime by varying system metric thresholds, which dictate the similarity between adjacent phases. Even ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:39 Table IV. Summary of Phase Change Detection Techniques Techniques System performance-based Application structure-based Memory access-based Characteristics 1. Monitor system performance to detect phase changes. 2. Input dependence for offline usage. 1. Analyze application loops and functions to detect phase changes. 2. Generally more complicated than system performance-based. 1. Use reuse distance of memory accesses to detect phase changes. 2. More effective for cache tuning than application structure-based. 3. Input dependence for offline usage. Online usage Yes Offline usage Yes Yes Yes No Yes though most techniques use fixed thresholds, where the threshold is predetermined for an application using an offline profiler, dynamically adaptive thresholds are developed and outperform fixed, predetermined thresholds. However, the initial value selection for dynamically changed thresholds remains challenging. Hind et al. [2003] provided an analysis on defining granularity and similarity for a phase and indicated that different granularity and similarity values may produce significantly different phase change detection results. The overall-phase change detection and classification for SMT and CMP architectures are mainly based on the techniques for single-threaded single-core architectures. Since multiple threads can co-execute different application portions at the same time, the combination of any phase from each single-threaded application may occur as one overall-phase. Since the number of phase combinations grows exponentially with the number of threads, previous works mainly focused on reducing the number of overallphase classes using techniques to globally classify phases across threads and merge the phase combinations. 4. CACHE POWER/ENERGY ESTIMATION The cache tuning techniques reviewed in the previous sections, including offline simulations, analytical modeling, and online cache monitoring, only evaluated cache performance based on statistics, such as the number of cache accesses, cache misses, write backs, and additional latencies represented by the CPI to determine the cache’s energy consumption. The cache performance is determined directly from the system architecture and the input datasets. However, since the main goal of cache tuning is to optimize the cache configuration to minimize the overall system energy consumption, only evaluating the cache performance is insufficient. There exists much research on circuit and VLSI power and energy consumption analysis; however, our survey does not review these techniques and instead focuses on architectural-level power/energy quantitative techniques and widely used estimation tools. Different cache configurations with varying cache sizes consume different cache leakage power (assuming that the fabrication technique and supply voltage is fixed). In addition, different cache configurations produce diverse dynamic energy consumption for each cache read/write. Cache misses introduce memory accesses and CPU stalls, whose power consumption should be considered in addition to the cache configuration effects. The cache leakage power, cache dynamic energy per read/write, cache latency per access, and the number of cache accesses and misses are used to model the cache-related energy consumption [Zhang et al. 2004; Dropsho et al. 2002a] for in-order cores. However, for more complicated contemporary cores, which leverage various techniques to ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:40 W. Zang and A. Gorden-Ross hide cache/memory access latencies (e.g., out-of-order execution, instruction-level parallelism, memory-level parallelism, etc.), a comprehensive energy model is difficult to construct. These same challenges also arise for multithreaded/multicore architectures. Caches are constructed using regular SRAM arrays, thus the power is easily estimated based on the cache size and organization. Cacti [Tarjan et al. 2006] is a widely used memory hierarchy simulation tool that provides not only the cache leakage power and dynamic energy per read/write, but also the area and access latency. Cacti is based on analytical modeling, in which the power/energy is calculated using parameterized equation models. The Wattch tool, such as SimplePower [Brooks et al. 2000] used with SimpleScalar [Austin et al. 2002], provides power consumption for the entire processor. SimplePower’s cache modeling is based on Cacti. Wattch only models dynamic power consumption, which could be compensated by integrating the HotLeakage package [Zhang et al. 2003]. HotSpot [Huang et al. 2006] provides an architecture-level temperature model based on compact thermal models and stacked layer packaging by considering the heat flow and electrical phenomena. SimWatch [Chen et al. 2007] integrates Simics [Magnusson et al. 2002] and Wattch in a full-system simulation environment. SimWatch can evaluate the power efficiency of microarchitectures, applications, compilers, and operating systems. SESC [Renau et al. 2005] models the power for multithreaded processors and the cache hierarchy based on Wattch and Cacti, respectively, and models the temperature using SESCSpot based on HotSpot. IBM PowerTimer [Brooks et al. 2003] simulates the power for an entire processor using a simulator based on empirical techniques. PowerTimer uses the modules’ power consumptions as an existing reference processor to predict the desired architectural model by scaling by an appropriate factor. The scaling factor for cache power prediction is dictated by a sophisticated power effect analysis based on the changes in cache parameters and organization. Cache energy consumption calculation based on simulation tools can be employed in both offline and online cache tuning. In offline cache tuning, the cache leakage power and dynamic energy per read/write for all cache configurations are determined and stored offline. During online cache tuning, the cache performance (accesses and misses) is monitored and preserved using hardware counters. In order to adapt the cache to minimize the energy consumption at runtime, the system’s normal execution may be impeded by the hardware/software cache energy calculation. A direct and simple technique is to measure the energy consumption for the entire system during runtime using an on-chip temperature sensor [Sanchez et al. 1997], but only online cache tuning can leverage this technique. 5. CONCLUSIONS AND CHALLENGES Cache tuning plays an important role in reducing cache power/energy consumption, thereby resulting in large total microprocessor system power/energy reduction. We have surveyed various cache tuning techniques based on two categories, offline static cache tuning and online dynamic cache tuning, ranging from hardware support to cache tuning techniques, coupled with phase-change detection techniques. The tuning techniques adjust cache parameters and organization in accordance with changing cache behavior and/or system performance, which is obtained from different approaches, such as system execution, system simulation, memory access trace simulation/profiling, or application structure analysis. Given complex modern systems, very large design spaces, and diverse applications, different techniques have their own merits, limitations, and challenges for different circumstances. Therefore, it is difficult to classify a single, paramount cache tuning technique. This survey focuses on cache tuning to minimize energy consumption by varying four main cache parameters: total size, block size, associativity, and cache organization in multicore architectures. In addition to continually developing new architectures and ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:41 techniques in this area, researchers have extended cache tuning techniques to other storage components, such as issue queues, register files, TLBs, and buffers. Other topics related to cache tuning include: cache compression, shared cache bandwidth partitioning, cache interconnection network optimization, tuning the cache to trade off the power/energy minimization and system performance optimization, and DVFS (dynamic voltage and frequency scaling) to manage system dynamic power. Additionally, emerging techniques in integrated circuit design and three-dimensional (3D) die-stacking have been proposed to solve the high area overhead problem of SRAM; however, a review of these related topics is beyond the scope of this survey. Even though cache tuning has been investigated extensively, many open challenges still remain, which we summarize as follows. (1) Even though a large number of simulators have been developed and each simulator offers different coverage, accuracy, and simulation speed, a specially-designed, accurate, fast cache simulator for an arbitrary cache hierarchy is still needed. (2) Trace-driven cache simulation is fast, especially using single-pass techniques to evaluate multiple configurations, thus providing efficient offline cache tuning. The main challenge is to capture the dynamic and timing-dependent behavior for simulations of out-of-order processors and multithreaded/multicore architectures. (3) For cache tuning aimed at minimizing system power/energy consumption, calculating system power/energy consumption based on the cache misses acquired from offline simulation or online software/hardware profiling is challenging due to increasing interrelated factors and system complexity. (4) Although a heuristic search of the design space can dramatically reduce the cache tuning time, heuristic techniques are difficult to design especially for complex architectures. Since heuristics are based on empirical experiences, which may depend on the particular cache hierarchy and organization, designing a versatile and generalized heuristic for arbitrary cache hierarchies and organization is still an open research area. (5) For multilevel and multicore cache tuning, decoupling the cache interference between levels and cores can reduce the cache tuning complexity. (6) Online cache tuning is a prominent future direction, since online cache tuning can dynamically react to changing execution and does not require any design time effort. Existing online cache tuning and phase-change detection techniques introduce hardware overhead and are intrusive to normal system execution. Reducing the impact of cache tuning and phase change detection on performance and energy consumption is critical. (7) Cache tuning considering data proximity, wire-delay, and on-chip network traffic is necessary for both performance and energy optimizations and is an open topic. (8) Cache tuning considering additional resource contentions in simultaneousmultithreaded (SMT) processors has not been addressed completely. (9) There is little research in tuning the entire cache subsystem, especially for multicore architectures. (10) Currently, the major targets of cache partitioning are overall performance improvement and fairness guarantees. Combining cache partitioning and leakage power reduction techniques (dynamically deactivating portions of the cache using way/bank and block/set shutdown) to obtain both performance and energy optimizations is an interesting topic. (11) Integrating operating system page coloring with other cache partitioning techniques, such as modifying the replacement and insertion policies and cooperative caching, can enhance the performance and energy optimization, which has not been thoroughly investigated. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:42 W. Zang and A. Gorden-Ross (12) Precisely detecting a phase change is challenging, since online phase-change detection is reactive and proactive offline phase change detection is generally bounded by the inputs used during offline analysis. (13) Existing phase-change detection techniques for multithreaded/multicore architectures assume that the phases determined for a single-threaded application are valid when the application co-executes with other applications. The validation of this assumption for multicore applications with high intercore dependency and interaction is necessary. (14) Emerging techniques developed in integrated circuit design for cache bring new challenges to cache tuning. REFERENCES ALBONESI, D. H. 1999. Selective cache way: On-demand cache resource allocation. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, Washington, DC, 248–259. AMMONS, G., BALL, T., AND LARUS, J. R. 1997. Exploiting hardware performance counters with flow and context sensitive profiling, In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 85–96. ANDERSON, J. M., BERC, L. M., DEAN, J., GHEMAWAT, S., HENZINGER M. R., LEUNG S. A., SITES, R. L., VANDEVOORDE, M. T., WALDSPURGER C. A., AND WEIHL W. E. 1997. Continuous profiling: Where have all the cycles gone? ACM Trans. Comput. Syst. 15, 4, 357–390. AUSTIN, T., LARSON, E., AND ERNST, D. 2002. SimpleScalar: An infrastructure for comput. system modeling. IEEE Comput. 35, 2, 59–67. AWASTHI, M., SUDAN, K., BALASUBRAMONIAN, R., AND CARTER, J. 2009. Dynamic hardware-assisted softwarecontrolled page placement to manage capacity allocation and sharing within large caches. In Proceedings of Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 250–261. BALASUBRAMONIAN, R., ALBONESI, D., BUYUKTOSUNOGLU, A. AND DWARKADAS, S. 2000. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, ACM, New York, NY, 245–257. BALASUBRAMONIAN, R., JOUPPI, N. P., AND MURALIMANOHAR, N. 2011. Multi-core cache hierarchies. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. BECKMANN, B., MARTY, M., AND WOOD, D. 2006. ASR: Adaptive selective replication for CMP caches. In Proceedings of the 39th Annual IEEE/ ACM International Symposium on Microarchitecture (MICRO). IEEE, Los Alamitos, CA, 443–454. BEDICHEK, R. 2004. SimNow: Fast platform simulation purely in software. In Proceedings of the Symposium on High Performance Chips (HOT CHIPS).. BELLARD, F. 2005. QEMU, a fast and portable dynamic translator, USENIX’ 05 Technical Program. BENITEZ, D., MOURE, J. C., REXACHS, D. I., AND LUQUE E. 2006. Evaluation of the field-prorammable cache: Performance and energy consumption, In Proceedings of the 3rd Conference on Computing Frontiers, ACM, New York, NY, 361–372. BIESBROUCK, M. V., SHERWOOD, T., AND CALDER. B. 2004. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Washington, DC, 45–56. BINKERT, N. L., DRESLINSKI, R. G., HSU, L. R., LIM, K. T., SAIDI, A. G., AND REINHARDT, S. K. 2006. The M5 simulator: Modeling networked systems. IEEE Micro. 26, 4, 52-60. BOHR, M. T., CHAU, R. S., GHANI, T., AND MISTRY, K. 2007. The high-k solution, IEEE Spectrum. BREHOB, M. AND ENBODY, R. J. 1996. An analytical model of locality and caching. Tech. rep. Michigan State University, East Lansing, MI. BROOKS, D. M., TIWARI, V., AND MARTONOSI, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of 27th International Symposium on Computer Architecture. IEEE, Washington, DC, 83–94. BROOKS, D. M., BOSE, P., SRINIVASAN, V., GSCHWIND, M., EMMA, P., AND ROSENFIELD, M. 2003. New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors. IBM J. Res. Develop. 47, 5–6, 653–670. CHANDRA, D., GUO, F., KIM, S., AND SOLIHIN, Y. 2005. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. IEEE, Washington, DC, 340–351. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:43 CHANG, J. AND SOHI, G. 2006. Co-operative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA). IEEE, Washington, DC, 264–276. CHATTERJEE, S., PARKER, E., HANLON, P. J., AND LEBECK, A. R. 2001. Exact analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York, NY, 286–297. CHATTERJEE, B., SACHDEV, M., HSU, S., KRISHNAMURTHY, R., AND BORKAR, S. 2003. Effectiveness and scaling trends of leakage control techniques for sub-130 nm CMOS technologies. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). IEEE, Washington, DC, 122–127. CHEN, C. F., YANG, S., FALSAFI, B., AND MOSHOVOS, A. 2004. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the 10th International Symposium on High Performance Computer Architecture. IEEE, Washington, DC, 276. CHEN, J., DUBOIS, M., AND STENSTROM, P. 2007. SimWattch: Integrating complete-system and user-level performance and power simulators, IEEE Micro, 27, 4, 34–48. CHEN, X. E. AND AAMODT, T. M. 2009. A first-order fine-grained multithreaded throughput model. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, Washington, DC, 329–340. CHEN, J., ANNAVARAM, M., AND DUBOIS, M. 2009. SlackSim: A platform for parallel simulation of CMPs on CMPs. ACM SIGARCH Comput. Architect. News. 37, 2, 20–29. CHIDESTER, M. C. AND GEORGE, A. D. 2002. Parallel simulation of chip-multiprocessor architectures. ACM Trans. Model. Comput. Simul. (TOMACS) 12, 3, 176–200. CHIOU, D., CHIOUY, D., RUDOLPH, L., DEVADAS, S., AND ANG, B. S. 2000. Dynamic cache partitioning via columnization. Computation Structures Group Memo 430. Massachusetts Institute of Technology. CHO, S. AND JIN, L. 2006. Managing distributed, shared L2 caches through OS-Level page allocation. In Proceedings of the ACM/IEEE International Symposium on Microarchitectures (MICRO). IEEE, Washington, DC, 455–468 CMELIK, B. AND KEPPEL, D. 1994. SHADE: A fast instruction-set simulator for execution profiling. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 128–137. CONTE, T. M., HIRSCH, M. A., AND HWU, W. W. 1998. Combining trace sampling with single pass methods for efficient cache simulation. IEEE Trans. Comput. 47, 6, 714–720. DEAN, J., HICKS, J. E., WALDSPURGER, C. A., WEIHL, W. E., AND CHRYSOS, G. 1997. ProfileMe: Hardware support for instruction-level profiling in out-of-order processors. In Proceedings of the 30th Anual ACM/IEEE International Symposium on Microarchitecture. IEEE, Washington, DC, 292–302. DHODAPKAR, A. S. AND SMITH, J. E. 2002. Managing multi-configuration hardware via dynamic working set analysis. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, Washington, DC, 233–244. DHODAPKAR, A. S. AND SMITH, J. E. 2003. Comparing program phase detection techniques. In Proceedings of the International Symposium on Microarchitecture. IEEE, Washington, DC, 217. DÍAZ, J., HIDALGO, J. I., FERNÁNDEZ, F., GARNICA, O., AND LÓPEZ, S. 2009. Improving SMT performance: An application of genetic algorithms to configure resizable caches. In Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference. ACM, New York, NY, 2029–2034. DING, C. AND ZHONG, Y. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 245–257. DING, C. AND CHILIMBI, T. 2009. A composable model for analyzing locality of multi-threaded programs. Tech. rep. MSR-TR-2009-107, Microsoft. DROPSHO, S., BUYUKTOSUNOGLU, A., BALASUBRAMONIAN, R., ALBONESI, D. H., DWARKADAS, S., SEMERARO, G., MAGKLIS, G., AND SCOTT, M. L. 2002. Integrating adaptive on-chip storage structures for reduced dynamic power. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 141–152. DROPSHO, S., KURSUN, V., ALBONESI, D. H., DWARKADAS, S., AND FRIEDMAN, E. G. 2002. Managing static leakage energy in microprocessor functional units. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’35). IEEE, Los Alamitos, CA, 321–332. DUESTERWALD, E., CASCAVAL, C., AND DWARKADAS, S. 2003. Characterizing and predicting program behavior and its variability. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 220–231. DYBDAHL H. AND STENSTROM, P. 2007. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In Proceedings of the Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 2–12. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:44 W. Zang and A. Gorden-Ross EDLER, J. AND HILL, M. D. 1998. Dinero IV trace-driven uniprocessor cache simulator. http://www.cs.wisc. edu/~markhill/DineroIV. EDMONDON, J., RUBINFELD, P. I., BANNON P. J., BENSCHNEIDER, B. J., BERNSTEIN, D., CASTELINO, R. W., COOPER, E. M., DEVER, D. E., DONCHIN, D. R., FISCHER, T. C., JAIN, A. K., MEHTA, S., MEYER, J. E., PRESTON, R. P., RAJAGOPALAN, V., SOMANATHAN, C., TAYLOR, S. A., AND WOLRICH, G. M. 1995. Internal organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC microprocessor. Digi. Tech. J. Special 10th Anniversary Issue, 7, 1, 119–135. EECKHOUT, L., NUSSBAUM, S., SMITH, J. E., AND BOSSCHERE, K. D. 2003. Statistical simulation: Adding efficiency to the computer designer’s toolbox. IEEE Micro. 23, 5, 26–38. EECKHOUT, L. 2010. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. EKLOV, D., BLACK-SCHAFFER, D., AND HAGERSTEN, E. 2011. Fast modeling of shared cache in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM New York, NY, 147–157. FALCÓN, A., FARABOSCHI, P., AND ORTEGA. D. 2008. An adaptive synchronization technique for parallel simulation of networked clusters. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 22–31. FANG, C., CARR, S., ONDER, S., AND WANG, Z. 2004. Reuse-distance-based miss-rate prediction on a per instruction basis. In Proceedings of the Workshop on Memory System Performance. ACM, New York, NY, 60–68. FLAUTNER, K., KIM, N. S., MATIN, S., BLAAUW, D., AND MUDGE, T. 2002. Drowsy caches: Simple techniques for reducing leakage power, In Proceedings of the 29th Annual International Symposium on Computer Architecture. ACM, New York, NY, 148–157. GENBRUGGE, D., EECKHOUT, L., AND BOSSCHERE K. D. 2006. Accurate memory data flow modeling in statistical simulation. In Proceedings of the 20th Annual International Conference of Supercomputing. ACM, New York, NY. GHOSH, S., MARTONOSI, M., AND MALIK, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst. 21, 4, 703–746. GHOSH, A., AND GIVARGIS, T. 2004. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. Design Autom. Electron. Syst. 9, 4, 419–440. GLUHOVSKY, I. AND O’KRAFKA, B. 2005. Comprehensive multiprocessor cache miss rate generation using multivariate models. ACM Trans. Comput. Syst. 23, 2. 111–145. GOLDSCHMIDT, S. AND HENNESSEY, J. 1992. The accuracy of trace-driven simulations of multiprocessors. Tech rep. CSL-TR-92-546, Stanford University. GORDON-ROSS, A., VAHID, F., AND DUTT, N. 2004. Automatic tuning of two level caches to embedded applications. In Proceedings of the Conference on Design, Automation and Test in Europe. IEEE, Washington, DC. GORDON-ROSS, A. AND VAHID, F. 2007. A self-tuning configurable cache. In Proceedings of the 44th Anual Design Automation Conference. ACM, New York, NY, 234–237. GORDON-ROSS, A., VIANA, P., VAHID, F., NAJJAR, W., AND BARROS, E. 2007. A one-shot configurable-cache tuner for improved energy and performance. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, San Jose, CA, 755–760. GORDON-ROSS, A., LAU, J., AND CALDER, B. 2008. Phase-based cache reconfiguration for a highly-configurable two-level cache hierarchy. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI. ACM, New York, NY, 379–382. GORDON-ROSS, A., VAHID, F., AND DUTT, N. 2009. Fast configurable-cache tuning with a unified second-level cache. IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 17, 1, 80–91. HAMERLY. G., PERELMAN, E., LAU, J., AND CALDER, B. 2005. SimPoint 3.0: Faster and more flexible program analysis. J. Instruct.-Level Parall. 7, 1–28. HANSON, H., HRISHIKESH, M. S., AGARWAL, V., KECKLER, S. W., AND BURGER, D. 2003. Static energy reduction techniques for microprocessor caches. IEEE Trans. Very Large Scale Integr. Syst. 11, 3, 303–313. HARDAVELLAS, N., FERDMAN, M., FALSAFI, B., AND AILAMAKI, A. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA). ACM, New York, NY, 184–195. HARPER, J. S., KERBYSON, D. J., AND NUDD, G. R. 1999. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput. 48, 10, 1009–1024. HEIDELBERGER, P. AND STONE, H. S. 1990. Parallel trace-driven cache simulation by time partitioning. In Proceedings of the 22nd Conference on Winter Simulation. IEEE, Piscataway, NJ, 734–737. HILL, M. D. AND SMITH, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12, 1612–1630. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:45 HIND, M., RJAN, V., AND SWEENEY, P. 2003. Phase shift detection: A problem classification. Tech. rep., IBM. HSU, L., REINHARDT, S., IYER, R., AND MAKINENI, S. 2006. Communist, utilitarian, and capitalist cache policies on CMPs: Caches as a shared resource. In Proceedings of the International Conference on Parallel Architectures and Computation Technologies (PACT). ACM, New York, NY, 13–22. HU, J. S., NADGIR, A., VIJAYKRISHNAN, N., IRWIN, M. J., KANDEMIR, M. 2003. Exploiting program hotspots and code sequentiality for instruction cache leakage management. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED). ACM, New York, NY, 402–407. HUANG, M., RENAU, J., AND TORRELLAS, J. 2003. Positional adaptation of processors: Application to energy reduction. In Proceedings of the 30th Anual International Symposium on Computer Architecture. ACM, New York, NY, 157–168. HUANG, W., GHOSH, S., VELUSAMY, S., SANKARANARAYANAN, K., SKADRON, K., AND STAN, M. R. 2006. HotSpot: A compact thermal modeling method for CMOS VLSI Systems. IEEE Trans. Very Large Scale Integ. Syst. 14, 5, 501–513. HUANG, C., SHELDON, D., AND VAHID, F. 2008. Dynamic tuning of configurable architectures: The AWW online algorithm. In Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 97–102. HUH, J., KIM, C., SHAFI, H., ZHANG, L., BURGER, D., AND KECKLER, S. 2007. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel Distribu. Syst. 18, 8, 1028–1040. HUGHES, C. J., PAI, V. S., RANGANATHAN, P., AND ADVE, S. V. 2002. Rsim: Simulating shared-memory multiprocessors with ILP processors. IEEE Computer 35, 2, 40–49. INOUE, K., MOSHNYAGA, V., AND MURAKAMI, K. 2001. Trends in high-performance, low-power cache memory architectures. IEICE Trans. Electronics 85, 314. IYER, R. 2003. On modeling and analyzing cache hierarchies using CASPER. In Proceedings of the 11th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, Washington, DC, 182–187. IYER, R. 2004. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In Proceedings of the 18th Annual International Conference on Supercomputing. ACM, New York, NY, 257–266. JALEEL, A., COHN, R. S., LUK, C. K., AND JACOB. B. 2008a. CMP$im: A pinbased on-the-fly multi-core cache simulator. In Proceedings of the 4th Annual Workshop on Modeling Benchmarking and Simulation. JALEEL, A., HASENPLAUGH, W., QURESHI, M., SEBOT, J., STEELY, JR. S., AND EMER, J. 2008b. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, ACM, New York, NY, 208–219. JANAPSATYA, A., LGNJATOVIĆ A., AND PARAMESWARAN, S. 2006. Finding optimal L1 cache configuration for embedded systems. In Proceedings of the Asia and South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 796–801. JANAPSATYA, A., LGNJATOVIĆ, A., PARAMESWARAN, S., AND HENKEL, J. 2007. Instruction trace compression for rapid instruction cache simulation. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, San Jose, CA, 803–808. JOSHI, A., YI, J. J., BELL, R. H., JR., EECKHOUT, L. JOHN, L., AND LILJA, D. 2006. Evaluating the efficacy of statistical simulation for design space exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 70–79. KAXIRAS, S., HU, Z., AND MARTONOSI, M. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th International Symposium on Computer Architecture. IEEE, Washington, DC, 240–251. KAXIRAS, S. AND MARTONOSI, M. 2008. Computer architecture techniques for power-efficiency. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. KESSLER, R. E. AND HILL, M. D. 1992. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst. 10, 4, 338–359. KIHM, J. L. AND CONNORS, D. A. 2005. A mathematical model for accurately balancing co-phase effect in simulated multithreaded systems. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation held in conjunction with ISCA-32. KIM, N. S., FLAUTNER, K., BLAAUW, D., AND MUDGE, T. 2002. Drowsy instruction caches–leakage power reduction using dynamic voltage scaling. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-35). IEEE, Los Alamitos, CA, 219–230. KIM, S., CHANDRA, D., AND SOLIHIN, Y. 2004. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, Washington, DC, 111–122. KIM, C. H., KIM, J., MUKHOPADHYAY, S., AND ROY. K. 2005. A forward body-biased low-leakage SRAM cache: Device, circuit and architecture considerations. IEEE Trans. Very Large Scale Integr. Syst. 13, 3, 349–357. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:46 W. Zang and A. Gorden-Ross LAHA, S., PATEL, J. H., AND IYER R. K. 1988. Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Trans. Comput. 37, 11, 1325–1336. LAU, J., SCHOENMACKERS, S., AND CALDER, B. 2004. Structures for phase classification. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, New Jersey, 57–67. LAU, J., SCHOENMACKERS, S., AND CALDER, B. 2005. Transition phase classification and prediction. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE, Washington, DC, 278–289. LAU, J., PERELMAN, E., AND CALDER, B. 2006. Selecting software phase markers with code structure analysis. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, Washington, DC, 135–146. LEBECK, A. AND WOOD, D. 1994. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27, 10, 15–26. LEE, H.-H. S., TYSON, G. S., AND FARRENS, M. K. 2000, Eager Writeback—A technique for improving bandwidth utilization. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, USA. 11–21. LEE, K. EVANS, S., AND CHO, S. 2009. Accurately approximating superscalar processor performance from traces. In Proceedings of the International Symposium Performance Analysis of Systems and Software (ISPASS), IEEE, Piscataway, New Jersey, 238–248. LEE, H., JIN, L., LEE, K., DEMETRIADES, S., MOENG, M., AND CHO, S. 2010. Two-phase trace-driven simulation (TPTS): A fast multicore processor architecture simulation approach. J. Soft.-Pract. Expe. 40, 3, John Wiley & Sons, Inc. New York, NY, 239–258. LEE, K. AND CHO, S. 2011. In-N-Out: Reproducing out-of-order superscalar processor behavior from reduced in-order traces. In Proceedings of the International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, Washington, DC, 126–135. LEE, H., CHO, S., AND CHILDERS, B. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 219–230. LI, L., KADAYIF, I., TSAI, Y. F., VIJAYKRISHNAN, N., KANDEMIR, M., IRWIN, M. J., AND SIVASUBRAMANIAM, A. 2002. Leakage energy management in cache hierarchies. In Proceedings International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 131–140. LI, Y., PARIKH, D., ZHANG, Y., SANKARANARAYANAN, K., SKADRON, K., AND STAN, M. 2004. State-preserving vs. non-state-perserving leakage control in caches. In Proceedings of the Conference on Design, Automation and Test in Europe, IEEE, Washington, DC, 10. LIN, J., LU, Q., DING, X., ZHANG, Z., ZHANG, X., AND SADAYAPPAN, P. 2009. Enabling software management for multicore caches with a lightweight hardware support. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, New York, NY. LIU, C., SIVASUBRAMANIAM, A., AND KANDEMIR, M. 2004. Organizing the last line of defense before hitting the memory wall for CMPs. In Proceedings of the Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC. 176–185. LUK, C.-K., COHN, R., MUTH, R., PATIL, H., KLAUSER, A., LOWNEY, G., WALLACE, S., REDDI, V. J., AND HAZELWOOD, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI). ACM, New York, NY. 190–200. MAGNUSSON, P. S., CHRISTENSSON, M., ESKILSON, K. J., FORSGREN, D., HALLBERG, G., HOGBERG, J., LARSSON, F., MOESTEDT, A., AND WERNER, B. 2002. Simics: A full system simulation platform. Computer. 35, 2. 50–58. MALIK, A., MOYER, B., AND CERMAK, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design. ACM, New York, NY, USA. 241–243. MARTIN, M. M. K., SORIN, D. J., BECKMANN, B. M., MARTY, M. R., XU, M., ALAMELDEEN, A. R., MOORE, K. E., HILL, M. D., AND WOOD, D. A. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Comput. Architec. News 33, 4. ACM New York, NY, 92–99. MATTSON, R. L., GECSEI, J., SLUTZ, D. R., AND TRAIGER, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78–117. MENG, Y., SHERWOOD, T., AND KASTNER, R. 2005. Exploring the limits of leakage power reduction in caches. ACM Tran. Architec. Code Optim. 2, 3, 221–246. MIHOCKA, D. AND SCHWARTSMAN, S. 2008. Virtualization without direct execution or jitting: Designing a portable virtual machine infrastructure. In Proceedings of the Workshop on Architectural and Microarchitectural Support for Binary Translation, held in conjunction with ISCA. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:47 MILLER, J. E., KASTURE, H., KURIAN, G., GRUENWALD, C., BECKMANN, N., CELIO, C., EASTEP, J., AND AGARWAL, A. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 1–12. MIPS R4000. Microprocessor user’s manual, http://groups.csail.mit.edu/cag/raw/documents/R4400_ Uman_book_Ed2.pdf.1994. MIPS32. 4KTM Processor core family software user’s manual, http://d3s.mff.cuni.cz/~ceres/sch/osy/ download/MIPS32-4K-Manual.pdf.2001. MONTANARO, J., WITEK, R. T., AND ANNE, K. ET AL. 1997. A 160-MHz, 32-b 0.5-W CMOS RISC microprocessor, Dig. Tech. J. 9, 1, 49–62. NAMKUNG, J., DOHYUNG K., GUPTA, R., KOZINTSEV, I., BOUGET, J.-Y., AND DULONG, C. 2006. Phase guided sampling for efficient parallel application simulation. In Proceedings of the International Conference Hardware/Software Codesign and System Synthesis (CODES + ISSS). ACM, New York, NY, 187–192. ORTEGO, P. M. AND SACK, P. 2004. SESC: SuperESCalar Simulator. http://iacoma.cs.uiuc.edu/~paulsack/ sescdoc/. PERELMAN, E., POLITO, M., BOUGUET, J.-Y., SAMPSON, J., CALDER, B., AND DULONG, C. 2006. Detecting phases in parallel applications on shared memory architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IEEE, Washington, DC, 88–98. POWELL, M. D., YANG, S., FALSAFI, B., ROY, K., AND VIJAYKUMAR, T. N. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the International Symposium on Low Power Electronics and Design, ACM, New York, NY, 90–95. POWELL, M., YANG, S.-H., FALSAFI, B., ROY, K., AND VIJAYKUMAR, T. N. 2001. Reducing leakage in a highperformance deep-submicron instruction cache. IEEE Trans. VLSI Syst. 9, 1, 77–89. QURESHI, M. AND PATT, Y. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the (MICRO). IEEE, Washington, DC, 423–432. QURESHI, M. K., JALEEL, A., PATT, Y. N., STEELY, S. C., AND EMER, J. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), ACM, New York, NY, 381–391. QURESHI, M. K. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 45–54. RAJKUMAR, R., LEE, C., LEHOCZKY, J., AND SIEWIOREK, D. 1997. A resource allocation model for QoS management. In Proceedings of the 18th IEEE Real-Time Systems Symposium. IEEE, Washington, DC, 298. RAMASWAMY, S. AND YALAMANCHILI, S. 2007. Improving cache efficiency via resizing + remapping. In Proceedings of the 25th International Conference on Computer Design. IEEE, Washington, DC, 47–54. RANGANATHAN, P., ADVE, S., AND JOUPPI, N. P. 2000. Reconfigurable caches and their application to media processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture. ACM, New York, NY, 214–224. RAWLINS, M. AND GORDON-ROSS, A. 2011. CPACT – the conditional parameter adjustment cache tuner for dualcore architectures. In Proceedings of the IEEE International Conference of Computer Design (ICCD). IEEE, Los Alamitos, CA. RAWLINS, M. AND GORDON-ROSS, A.2012. An application classification guided cache tuning heuristic for multi-core architectures. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, Piscataway, NJ. RENAU, J., FRAGUELA, B., TUCK, J., LIU, W., PRVULOVIC, M., CEZE, L., STRAUSS, K., SARANGI, S., SACK, P., AND MONTESINOS, P. 2005. SESC Simulator. http://sesc.sourceforge.net. RICO, A., DURAN, A., CABARCAS, F., ETSION, Y., RAMIREZ, A., AND VALERO, M. 2011. Trace-driven simulation of multithreaded applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, New Jersey, 87–96. ROSENBLUM, M., BUGNION, E., DEVINE, S., AND HERROD S.A. 1997. Using the SimOS machine simulator to study complex computer systems. ACM Trans. Model. Comput. Simul. 7, 1.78–103. SANCHEZ, H., KUTTANNA, B., OLSON, T., ALEXANDER, M., GEROSA, G., PHILIP, R., AND ALVAREZ, J. 1997. Thermal management system for high performance PowerPCTM microprocessors. In Proceedings of the 42nd IEEE International Computer Conference. IEEE, Washington, DC, 325–330. SEGARS, S. 2001. Low power design techniques for microprocessors. In Proceedings of the International Solid State Circuit Conference. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. 32:48 W. Zang and A. Gorden-Ross SHEN, X., ZHONG, Y., AND DING, C. 2004. Locality phase prediction. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systens. ACM, New York, NY, 165–176. SHEN, X., ZHONG, Y., AND DING, C. 2005. Phase-based miss rate prediction across program inputs. In Proceedings of the 17th International Workshop on Languages and Compilers for High Performance Computing, Springer, Berlin, Heidelberg, Germany, 42–55. SHERWOOD, T., PERELMAN, E., AND CALDER, B. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques., IEEE, Washington, DC, 3–14. SHERWOOD, T., SAIR, S., AND CALDER, B. 2003. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture. ACM, New York, NY, 336–349. SHERWOOD, T., PERELMAN, E., HAMERLY, G., SAIR, S., AND CALDER, B. 2003. Discovering and exploiting program phases. IEEE Micro, IEEE, Los Alamitos, CA, 23, 6, 84–93. SHI, X., SU, F., PEIR, J., XIA, Y., AND YANG, Z. 2009. Modeling and stack simulation of CMP cache capacity and accessibility. IEEE Trans. Parallel Distrib. Syst. 20, 12, 1752–1763. SHIUE, W. AND CHAKRABARTI, C. 2001. Memory design and exploration for low power, embedded systems. The J. VLSI Signal Process. Syst. 29, 3, 167–178. SRIKANTAIAH, S., KANDEMIR, M., AND IRWIN, M. 2008. Adaptive set pinning: Managing shared caches in chip multiprocessors. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, 135–144. SRIKANTAIAH, S., KULTURSAY, E., ZHANG, T., KANDEMIR, M., IRWIN, M., AND XIE, Y. 2011. MorphCache: A reconfigurable adaptive multi-level cache hierarchy for CMPs. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 231–242. SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A system for building customized program analysis tools. Tech. rep. 94/2,Western Research Lab, Compaq. SUH, G. E., RUDOLPH, L., AND DEVADAS, S. 2004. Dynamic partitioning of shared cache memory. J. Supercompu. 28, 1, 7–26. SUGUMAR, R. AND ABRAHAM, S. 1991. Efficient simulation of multiple cache configurations using binomial trees. Tech. rep. CSE-TR-111-91. SUGUMAR, R. A. 1993. Multi-reconfiguration simulation algorithms for the evaluation of computer architecture designs. Ph.D. Thesis, University of Michigan, Ann Arbor, MI. TARJAN, D., THOZIYOOR, S., AND JOUPPI, N. P. 2006. CACTI 4.0, Hewlett-Packard Laboratories Technical Report # HPL-2006-86. THOMPSON, J. G., AND SMITH, A. J. 1989. Efficient (stack) algorithms for analysis of write-back and sector memories. ACM Transactions on Computer Systems, 7, 1, 78–117. ISHIHARA, T. AND FALLAH, F. 2005. A non-uniform cache architecture for low power system design. IN Proceedings of the International Symposium on Low Power Electronics and Design. ACM, New York, NY, 363–368. UHLIG, R. A. AND MUDGE, T.N. 1997. Trace-driven memory simulation: A survey. ACM Comput. Surv. 29, 2, 128–170. VARADARAJAN, K., NANDY, S., SHARDA, V., BHARADWAJ, A., IYER, R., MAKINENI, S., AND NEWELL, D. 2006. Molecular caches: A caching structure for dynamic creation of application-specific heterogeneous cache regions. In Proceedings of the (MICRO), IEEE, Los Alamitos, CA, 433–442. VEIDENBAUM, A., TANG, W., GUPTA, R., NICOLAU, A., AND JI. X. 1999. Adapting cache line size to application behavior. In Proceedings of the International Conference on Supercomputing. ACM, New York, NY, 145– 154. VENKATACHALAM, V. AND FRANZ, M. 2005. Power reduction techniques for microprocessor systems. ACM Comput. Surv. 37, 3, 195–237. VERA, X., BERMUDO, N., LLOSA, J., AND GONZALEZ, A. 2004. A fast and accurate framework to analyze and optimize cache memory behavior. ACM Trans. Program. Lang. Syst. 26, 2, 263–300. VIANA, P., GORDON-ROSS, A., KEOGH, E., BARROS, E., AND VAHID, F. 2006. Configurable cache subsetting for fast cache tuning. In Proceedings of the ACM Design Automation Conference. ACM, New York, NY, 695–900. VIANA, P., GORDON-ROSS, A., BAROS, E., AND VAHID, F. 2008. A table-based method for single-Pass cache optimization. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI. ACM, New York, NY, 71–76. VIVEKANANDARAJAH, K., SIRKANTHAN, T., AND CLARKE, C. T. 2006. Profile directed instruction cache tuning for embedded systems. In Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures. IEEE, Washington, DC, 227. ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013. A Survey on Cache Tuning from a Power/Energy Perspective 32:49 WAN, H., GAO, X., LONG, X., AND WANG, Z. 2009. GCSim: A GPU-based trace-driven simulator for multi-level cache. Advan. Parallel Process. Technol. 177–190. WENISCH, T. F., WUNDERLICH, R. E., FERDMAN, M., AILAMAKI, A., FALSAFI, B., AND HOE, J. C. 2006. SimFlex:Statistical sampling of computer system simulation. IEEE Micro 26, 4, 18–31. WITCHELL, E. AND ROSENBLUM, M. 1996. Embra: Fast and flexible machine simulation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 68–79. WUNDERLICH, R. E., WENISCH, T. F., FALSAFI, B. AND HOE, J. C. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), IEEE, Washington, DC, 84–95. XIANG, X., BAO, B., BAI, T., DING, C., AND CHILIMBI, T. 2011. All-window profiling and composable models of cache sharing. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. ACM New York, NY, 91–102. XIE, Y. AND LOH, G. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. ACM SIGARCH Comput. Architec. News 37, 3, ACM New York, NY, 174–183. XU, C., CHEN, X. DICK, R. P., AND MAO, Z. M. 2010. Cache contention and application performance prediction for multi-core systems. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE, Piscataway, New Jersey, 76–86. YEH, T. AND REINMAN, G. 2005. Fast and fair: Data-stream quality of service. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). ACM New York, NY, 237–248. YOURST, M. T. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE, Piscataway, NJ, 23–34. ZANG, W. AND GORDON-ROSS, A. 2011. T-SPaCS - a two-level single-pass cache simulation methodology. In Proceedings of the 16th Asia and South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 419–424. ZHANG, W., HU, J. S., DEGALAHAL, V., KANDEMIR, M., VIJAYKRISHNAN, N., AND IRWIN, M . J. 2002. Compiler-directed instruction cache leakage optimization. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35). IEEE, Los Alamitos, CA, 208–218. ZHANG, Y., PARIKH, D., SANKARANARAYANAN, K. SKADRON, K., AND STAN, M. 2003. HotLeakage: A temperatureaware model of subthreshold and gate leakage for architects. Tech. rep. CS-2003-05, Department of Computer Science, University of Virginia, Charlottesville, VA. ZHANG, C., VAHID, F., AND LYSECKY, R. 2004. A self-tuning cache architecture for embedded systems. Special issue on Dynamically Adaptable Embedded System. ACM Trans. Embed. Comput. Syst. 3, 2, 1–19. ZHOU, H., TOBUREN, M. C., ROTENBERG, E., AND CONTE, T. 2003. Adaptive mode control: A static-power-efficient cache design. ACM Trans. Embed. Comput. Syst. 2, 3, 347–372. ZHONG, Y., DROPSHO, S., AND DING, C. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 91–101. Received May 2011; revised November 2011; accepted February 2012 ACM Computing Surveys, Vol. 45, No. 3, Article 32, Publication date: June 2013.