VLSI Architecture CONTENTS Sr.no I Introduction Page No 1 II Related Work 2 III IV Topic Dynamically Resizable Instruction Cache Cache Decay Partitioned Cache Architecture Selective Cache Ways Time Based Leakage Control In Partitioned Cache Architecture Overview Block Diagram Implementation Placement Strategies Prediction Strategies Deciding Cache Decay Interval 7 CONCLUSION 11 REFERENCES 12 1 VLSI Architecture 2 VLSI Architecture leakage energy of the processor. This paper SECTION I: INRODUCTION The advance in the semiconductor technology has paved way for increasing the suggests an architectural approach for reducing leakage energy in caches. Various density of transistors per chip. The amount of information storable on a given amount of silicon has roughly doubled every year since the technology was invented. Thus the performance of the processor improved and the chips’ energy dissipation increased in each processor generation. This created awareness for designing low power circuits. Low power is important in portable devices because the weight and size of the device is determined by the amount of battery needed which in turn depends on the amount of power dissipated in the circuit. The cost involved in providing power and associated cooling, reliability issues, expensive packaging made low power a concern in nonportable applications like desktops and servers too. Even though most power dissipation in CMOS CPUs is dynamic power dissipation (which is a function of frequency of operation of the device and the switching capacitance), leakage power (function of number of on-chip transistors) is also becoming increasingly significant as leakage current flows in every transistor that is on, irrespective of signal transition. Most of the leakage energy comes from memories, since cache occupies much of CPU chip’s area and has more number of transistors, reducing leakage in cache will result in a significant reduction in overall approaches have been suggested both in architecture and circuit level to reduce leakage energy. One approach is to count the total number of misses in a cache and upsize/ downsize the cache depending on whether the miss count is greater or lesser than a preset value. The cache dynamically resizes to the application’s required size and the unused sections of the cache are shut off. Another method called cache decay turns off the cache lines when they hold data not likely to be reused. The cache lines are shut off during their dead time that is during the time after the last access and before the eviction. After a specific number of cycles have elapsed and still if the data is unused then that cache line is shut off. Another approach was to disable portion of the cache ways called selective cache ways. This method, which is application sensitive, enables all the cache ways (a way is one of the n sections in an n-way set associative cache) when high performance is required and enables only a subset of the ways when cache demands are not high. This paper is organized as follows, section 1 narrates the work done related to this problem, section 2 describes our approach, which uses a time based decay policy in a partitioned architecture of level – 3 VLSI Architecture 2 cache and finally section 3 presents the series with SRAM cell-stacking conclusion. effect) with only 5% area increase. By controlling the miss rate with reference to a preset value, the performance degradation and the SECTION II: RELATED WORK increase in lower cache levels’ Dynamically resizable instruction cache: energy dissipation (due to misses in L1 cache) is kept low. This method exploits the utilization of cache, cache utilization varies depending The dynamic energy of the counter on the application requirements. By shutting hardware used is small as the of portion of the cache that is unused, average number of bits switching on leakage energy can be reduced significantly. a counter increment is less than two It uses a dynamically resizable I-cache (as the ith bit in a counter switches architecture, which resizes in accordance once only every 2^i increments). with the application requirements and uses a technique called gated-Vdd in the circuit Demerits: level to turn off unused portions of the Here resizing affects the miss rate, a cache. The number of misses is counted miss in L1 cache will lead to periodically million dynamic energy dissipation in L2 instructions), the cache size is increased or cache, so the number of accesses to decreased depending on whether the count is L2 cache should be low. (say every 1 more or less than a preset value. The cache due to the resizing bits. is also prevented from thrashing by fixing a minimum size beyond which the cache There is an extra L1 dynamic energy Resizing circuitry may increase energy dissipation offsetting the cannot be decreased. gains form cache resizing, so the resizing frequency should be low. Merits: Reduces the average size of a 64 K Longer resizing will span multiple cache by 62%, thus lower leakage application energy opportunity for resizing and shorter and the performance phases reducing degradation is within 4%. resizing interval may result in By employing a wide NMOS dual- increase in overhead. Vt gated-Vdd implementation the leakage is virtually eliminated Resizing form one size to another will modify the set-mapping (connecting gated Vdd transistor in 2 VLSI Architecture function for blocks and may result As seen in fig A, the access interval in an incorrect lookup. is the time between two hits, dead time is the For an application, which requires a time between last hit and the time at which small I-cache, dynamic component the data is evicted. will be large due to large number of resizing tag bits. Gated Vdd transistor must be large to sink the current flowing through SRAM cells during read/ write operation. But too large gated Vdd reduces stacking effect and Fig B increases area of overhead. As seen in the fig B, the dead time for most of the benchmarks is high. Cache decay: In this technique, cache lines, which hold data that are not likely to be reused, are Merits: leakage energy achieved. turned off. It exploits the fact that cache lines will be used frequently when data is 70 % reduction in L1 data cache Program performance or dynamic first brought in and then there will be a power period of dead time before the data is affected much as the cache line is evicted. So by turning off the cache lines turned off only during its dead time. during their dead time, leakage energy can dissipation will not be Results show that dead times are be reduced significantly without additional long, thus moderately easy to misses incurred thus performance will be identify. comparable to a conventional cache. The Very successful if application has policy used here is a time-based policy that poor reuse of data like streaming turns a cache line off if a pre-set number of applications. cycles have elapsed since its last access. Can be applied to outer levels of cache hierarchy (as outer levels are likely to have longer generations with larger dead time intervals) Demerits: Fig A There might be additional L1 misses if there is a miss in the L1 3 VLSI Architecture cache due to early shut off of the accessed per data reference, and thus the cache line. energy consumed. Shorter decay intervals (time after which cache line is shut off) reduce Merits: leakage energy but may increase Reduces per access energy costs miss rate, leading to dynamic Improves locality behavior energy dissipation in lower level Smaller memory. Partitioned Cache Architecture: into smaller units (subcaches) each of which acts as a cache and selectively disables unused components. Since the partition is at the architecture level the data placement and data probing mechanisms are sophisticated than those at the circuit level. The topology of subcaches may be different. The cache predictor tells the cache controller, which determines should of the energy be the activated. probing number of The subcaches can all be the same or caches with different topology This method partitions the cache effectiveness lesser consuming components subcaches and Both performance and energy can be optimized Breaking up into sub caches or sub banks reduces wiring and diffusion capacitances of bit lines and wiring and gate capacitances of word lines. Thus dynamic energy consumption when accessing the cache will be less. The strategy subcaches 4 VLSI Architecture PARTITIONED CACHE ARCHITECTURE MISSING PHYSICAL BLOCK # No Prediction SUB CACHE 1 CACHE PREDICTOR M Reprobe Default Predictor Logic SUB CACHE 2 Feedback V I R T U A L E CACH E ID M Cache ID CACHE CONTROLLER SUB CACHE 3 Page Offset A D D R E S S CACHE MISS AND PLACEMEN T LOGIC O Frame # | | | | | | | | | | | | | | | TLB R Y SUB CACHE N SUB CACHES Fig C 1 VLSI Architecture ARCHITECTURE OF A SUB-CACHE GLOBAL COUNTER VALID BIT WRD WRD LOCAL 2 BIT COUNTER V CACHE-LINE (DATA +TAG) LOCAL 2 BIT COUNTER V CACHE-LINE (DATA +TAG) CASCADED TICK PULSE T ROW DECODERS V Vb B B M FSM 2-BIT COUNTER Bb M V Vg WRD POWER OFF RESET ALWAYS POWERED WRD Bb SWITCHED POWER WRD WRD s1 s0 0 0 0 T 1 1 1 1 0 T/PowerOff T T State Diagram for 2-bit(S1,S0),Saturating,Gray Code Counter with two inputs(WRD,T) Fig D 2 VLSI Architecture requirements, enabling cache ways and Demerits: If large number of cycles is spent saving cache way select register. Thus this is in servicing a memory request a combination of hardware and software because of a poor probing strategy then performance will be levels and how they are affected by Performance depends on the effectiveness disabled depends on the relative energy dissipation of different memory hierarchy degraded. elements. The degree to which ways are of the disabling ways. probing policy, if the probing policy is SECTION not good then there will be LEAKAGE reprobing penalty. PARTITIONED Energy depends on the number ARCHITECTURE III: TIME BASED CONTROL IN CACHE of subcaches accessed per data reference. Overview: Level 2 cache is larger in size than level Selective cache ways: 1 cache, thus level 2 cache will dissipate This method exploits the subarray more leakage energy than level 1 cache. partitioning that is usually already present Thus by reducing leakage energy in L2 and enables all the cache ways when cache, overall leakage energy can be required to achieve high performance but reduced to a great extent. only a subset of cache ways when cache combines two existing strategies, to reduce demands are not high. Since only a subset of leakage power in level 2 cache. This paper the cache ways is active leakage energy can exploits the advantages of partitioning and be reduced significantly. This strategy time based cache decay techniques. The exploits the fact that cache requirements level 2 cache is partitioned into smaller units vary considerably between applications as each of which is a cache by itself called well as within an application. A software subcache. visible register called the cache way select partitioning the cache structure, shutting of register (CWSR), signals the hardware to part of cache ways during their dead time, enable/ disable particular ways. Special partitioning the sub arrays of a cache instructions are there for writing and reading structure. cache way select register. Software also partitioning the cache structure into small plays a role in analyzing application cache caches called subcaches and implementing This paper Methods were proposed for In this paper we suggest 3 VLSI Architecture the cache decay (shutting of portions of Program performance will not be cache ways) in each subcache. This can affected much as the cache line is reduce the leakage energy significantly. turned of only during its dead time. Subcache architecture enjoys the following Time based cache decay works well benefits: if the reuse of data is poor, reuse of Reduces per access energy costs data in L2 cache will be less than Improves locality behavior that in L1 cache, so it is appropriate Smaller and lesser consuming components to apply this technique to L2 cache. energy Outer levels of hierarchy are likely Both performance and energy can to have longer generations with be optimized larger dead time intervals, which is Breaking up into subcaches or what is required for this time based subbanks cache decay technique. reduces wiring and diffusion capacitances of bit lines The fraction of time the cache way and wiring and gate capacitances of is dead increases with higher miss word lines. Thus dynamic energy rate as the lines spend more of their consumption when accessing the time about to be evicted. cache will be less. Implementation: This architecture selectively disables unused The block diagram of the subcaches and activates the one holding the hardware implementation is shown in data, thereby leakage energy can be reduced Fig C. The level 2 cache is divided into significantly. By applying the time based cache decay technique to each of the subcache, only part of the cache ways will be enabled within a subcache, the power wasted on dead times (when cache way is smaller units each acting like a cache by itself, these are called as SUBBANKS or SUBCACHES. The subcache that needs to be activated is decided by a idle) can be avoided, thus this combination logic called CACHE PREDICTOR. This of partitioning and selective cache ways can operation is performed concurrently with reduce the leakage energy more than that table lookup operation in order to avoid when one technique is applied. Selective delay in critical path. The output of the cache ways is an appropriate technique to be cache predictor will be the subcache id used in subbank because of the following or will be a no prediction. If the output is reasons: a no prediction then a logic called DEFAULT PREDICTOR will be used to 2 VLSI Architecture select the cache for activation. Based on the SRAM cell is turned off, disabling the cache predictor output the CACHE that cache way. Thus the cache ways that CONTROLLER are idle will be disabled. will activate the appropriate subcache. The check will be If the cache controller cannot made only with the cache ways that are find data in the selected subcache then active within the subcache, not all cache the CACHE MISS logic informs the RE- ways the PROBE logic, which determines the next subcache. Disabling the cache ways subcache to probe. The re-probe logic within a subcache is done by means of a will be active until the data is found. time based decay policy (fig D). When the data cannot be found on any of will be enabled within The time based decay policy is the subcache then the cache miss and implemented in each subcache. Each placement logic will become active and cache is brings the block from main memory. connected to a counter, this counter is a The information in cache predictor and 2-bit counter (local counter) which re-probe logic is updated, as one of the increments its value after receiving the blocks needs to be evicted. The cache tick pulse form a global counter. The predictor is also updated whenever there two inputs the local counter receives are is a cache hit that was not predicted by the global tick signal T and the cache it. line within a subcache line access signal WRD. When the 2 bit The global counter used can be counter reaches its maximum value, the common for all subcaches. The tick decay interval, (It is found that for L2 signal is cascaded from one local counter caches the decay interval should be in to another with one clock cycle latency, the range of tens of thousands of cycles) so that writebacks cannot take place at which is the time allowed before which same the line is shut off would have elapsed. implement the state machine shown in On every access to the cache line the 2- figure D. The output of the local counter bit counter is reset to its initial value. is a power off signal that goes to gated Once the counter saturates to its Vdd transistor, which will be turned off maximum value the cache line is shut off when asserted. time. Each cache line will using gated Vdd technique. The gated The subcache architecture adopts Vdd transistor connected in series with certain policies for placing data if a miss 3 VLSI Architecture is encountered, predicting the subcache (probing) in which data will be present, reprobing subcaches in case the first probing fails. These are described below. 4 VLSI Architecture Placement strategies: concurrently, thus this strategy will not This tells how a data from memory provide any energy savings during probing is placed in the subcache system. Selecting a stage. MRU/WP strategy accesses the most good placement policy has both energy and recently used subcache first. This most performance implications and it depends on recently used information can be maintained the amount of past history maintained by the in a single register. CIB- Cache Identifier system. Once the subcache is selected then Buffer strategy holds a list of mostly the data will be placed inside the subcache recently used virtual addresses and the by its own topology. The different policies corresponding physical subcaches holding are random, least-recently-used (LRU), those blocks. Whenever the CIB is not able spatial temporal (ST) and modified-spatial- to make a prediction, a default predictor temporal (MST). In random policy one of predicts the subcache. The CIB entries are the subcaches is selected at random, In LRU updated whenever the corresponding cache the subcache which was least recently used line is accessed and evicted by the cache is selected for placement. Usually the spatial miss and placement logic. data and temporal data are stored in separate In order to reduce reprobe penalty subcaches, if in case a bypass data comes in on the first probe misses, it is good to probe then this is stored in either spatial or all subcaches other than the one already temporal subcache instead of fixing a accessed simultaneously. CIB is an effective location for the bypass data, this strategy is probing strategy. When the program exhibits called MST and the performance is found to good locality CIB has a high probability of be better if done this way, as the number of making a prediction. misses is reduced. The data will be stored in The decay interval that determines the subcache spatial or temporal that has less the time each cache way is shut off within a number of misses for better load balance. subcache has to be chosen properly in order MST improves performance, as the number to avoid extra misses in the cache. Some of misses are less when compared to other techniques is described below. strategies. Prediction strategies: This is the strategy used to probe the data in the subcache. This needs to be good otherwise there will be a penalty if miss occurs, as reprobing has to be done. The strategy ‘All ‘ accesses all subcaches 5 VLSI Architecture Deciding cache decay interval: Two things can be done either the cache line can be turned off at some reference point deciding that the cache line is worth turning off or this fact can be watched over a period of time and turned off if no further access occurs. One policy is to turn off at a point in time where the extra cost incurred by waiting is precisely equal to extra cost that might occur if the action done is wrong. If waited longer, then the leakage energy dissipated will be more, on the other hand if the decay interval is short then the number of L2 misses will be more, which cannot be tolerated especially if the miss penalty is from off chip. One best time to SECTION IV: CONCLUSION turn off is when the static energy dissipated Thus by partitioning the cache into since last access is precisely equal to smaller units and shutting of portion of the dynamic energy that would be dissipated if cache ways in each subcache using time turning the line off induces an extra miss. based decay policy, the leakage energy can The decay interval for L1 cache was found be reduced more than that when single to be about 10,000 cycles, since higher level technique is implemented without much cache tends to have longer generation with performance degradation. Partitioning the large dead time interval the decay interval cache into smaller units offers many benefits for L2 cache will be greater. like reducing per access energy, improving locality behavior, less dynamic energy consumption. The subcache architecture uses efficient placement and prediction schemes. The placement scheme is MST (modified spatial temporal), which reduces, miss rate by placing the data at the appropriate subcache (spatial or temporal), the prediction scheme is CIB (cache identifier buffer), which predicts good especially when the program exhibits good 6 VLSI Architecture locality. Since the time based decay policy scheme involves turns off the cache only during its dead time multitude of decay intervals per cache line, the performance will not be affected much, by varying the decay interval per cache line as the number of misses in the cache will be depending upon the utilization of cache line, less. The time based decay policy exploits additional energy due to leakage can be the generational characteristics of cache-line saved. usage. Here individual cache lines are turned technique can also be applied to level 1 off during their dead period (the time caches. The time selecting based among cache a decay between the last successful access and line’s eviction). A global counter provides a tick pulse for the local counters of all cache lines. The cache line is turned off when the local counter reaches its maximum value. Compared to standard cache of various sizes a decay cache offers better active size, for the same miss rate or better miss rate for the same active size. The decay interval can be varied as per the utilization of the cache line, thus saving more energy. The adaptive REFERENCES: 1. Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T.N. Vijaykumar, “Reducing Leakage in a High-Performance Deep-Submicron Instruction Cache,” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.9 No.1, February 2001 2. Stefanos kaxiras, Zhigang Hu, and Margaret Martonosi, “Cache Decay: Exploiting Generational Behaviour to Reduce Cache Leakage Power” 3. DavidH. Albonesi, “Selective Cache Ways: On-Demand Cache Resource Allocation,” in Journal of Instruction-Level Parallelism 2 (2000) 1-6 May 2000. 4. S.Kim, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, M.J. Irwin and E. Geethanjali, “Power-aware Partitioned Cache Architectures” 5. John L Hennessy and David A Patterson, “Computer Architecture A Quantitative Approach,” second edition. 6. Anantha P. Chandrakasan and Robert W. Brodersen, “Minimizing Power Consumption in Digital CMOS Circuits,” in Proceedings 7 VLSI Architecture 7. of the IEEE, Vol. 83, NO. 4, April 1995. 1