Using a Victim Buffer in an ApplicationSpecific Memory Hierarchy Chuanjun Zhang*, Frank Vahid** *Dept. of Electrical Engineering Dept. of Computer Science and Engineering University of California, Riverside **Also with the Center for Embedded Computer Systems at UC Irvine This work was supported by the National Science Foundation and the Semiconductor Research Corporation Chuanjun Zhang, UC Riverside 1 Low Power/Energy Techniques are Essential Skadron et al., 30th ISCA Hot enough to cook an egg. High performance processors are going to be too hot to work Low energy dissipation is imperative for battery-driven embedded systems Low power techniques are essential to both embedded systems and high performance processors Frank Vahid, UC Riverside 2 Caches Consume Much Power Caches consume 50% of total processor system power ARM920T and M*CORE(Segars 01, Lee 99) Caches accessed often Associativity reduces misses Consume dynamic power Less power off-chip, but more power per access Victim buffer helps (Jouppi 90) Add to direct-mapped cache Keep recently evicted lines in small buffer, check on miss Like higher-associativity, but without extra power per access 10% energy savings, 4% performance improvement (Albera 99) Frank Vahid, UC Riverside Processor Cache Victim buffer Memory 3 Victim Buffer PROCESSOR With a victim buffer One cycle Two cycles HIT MISS HIT Miss Without a victim buffer L1 cache Victim buffer 22cycles cycles 21 One cycle on a cache hit Two cycles on a victim buffer hit Twenty two cycles on a victim buffer miss One cycle on a cache hit Twenty one cycles on a victim buffer miss More accesses to off-chip memory OFFCHIP MEMORY Frank Vahid, UC Riverside 4 Cache Architecture with a Configurable Victim Buffer Is a victim buffer a useful configurable cache parameter? SRAM cache Vdd control circuit Thus, want ability to shut off VB for given app. One bit register A switch Four-line victim buffer shown tag reg VB misses, so extra cycle wasteful? Hardware overhead VB on/off Helps for some applications For others, not useful data to processor to mux L1 cache data SRAM victim line 1 0 from cache s control circuit 27-bit tag 16-byte cache line data data from next level memory control signals CAM SRAM Fully-associative victim buffer control signals to the next level memory Frank Vahid, UC Riverside 5 Hit Rate of a Victim Buffer 100% Data cache 50% mpeg jpeg art mcf mpeg jpeg art mcf Ave pegwit pegwit vpr g721 g721 parser epic epic adpcm v42 ucbqsort pjepg fir g3fax brev blit binary bilv bcnt auto2 crc padpcm 0% 100% 8 Kbyte 4 Kbyte 2 Kbyte Instruction cache Ave vpr parser adpcm v42 ucbqsort pjepg fir g3fax brev blit binary bcnt auto2 crc padpcm 0% bilv 50% Hit rate of victim buffer when added to an 8 Kbyte, 4 Kbyte, or 2 Kbyte direct-mapped cache Benchmarks from Powerstone, MediaBench, and Spec 2000. Frank Vahid, UC Riverside 6 Computing Total Memory-Related Energy Consider CPU stall energy and off-chip memory energy Excludes CPU active energy Thus, represents all memory-related energy energy_mem = energy_dynamic + energy_static energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fill energy_static = cycles * energy_static_per_cycle energy_miss = k_miss_energy * energy_hit energy_static_per_cycle = k_static * energy_total_per_cycle (we varied the k’s to account for different system implementations) Underlined – measured quantities SimpleScalar (cache_hits, cache_misses, cycles) Our layout or data sheets (others) Frank Vahid, UC Riverside 7 Performance and Energy Benefits of Victim Buffer with a Direct-Mapped Cache Substantial benefit 12% performance energy 60% 13% 38% 43% 24% 21% 8% 15% 4% vpr parser mcf art jpeg mpeg pegwit g721 epic adpcm v42 ucbqsort pjeg fir g3fax brev blit binary bilv bcnt auto2 Should shut-off VB crc -4% padpcm 0% An 8-line victim buffer with an 8 Kbyte direct-mapped cache (0%=DM w/o victim buffer) Configurable victim buffer is clearly useful to avoid performance penalty for certain applications Frank Vahid, UC Riverside 8 Is a Configurable Victim Buffer Useful Even With a Configurable Cache We showed that a configurable cache can reduce memory access power by half on average 2 Kb way (Zhang/Vahid/Najjar ISCA 03, ISVLSI 03) Software-configurable cache Line Associativity – 1, 2 or 4 ways Size: 2, 4 or 8 Kbytes Does that configurability subsume usefulness of configurable victim buffer? Normalized energy 1.0 0.8 0.6 0.4 0.2 epic mpeg2 0.0 1 Frank Vahid, UC Riverside 2 Associativity 4 9 Best Configurable Cache with VB Configurations Optimal cache configuration when cache associativity, cache size, and victim buffer are all configurable. I and D stands for instruction cache and data cache, respectively. V stands for the victim buffer is on. nK stands for the cache size is n Kbyte. The associativity is represented by the last four characters Benchmark vpr, I2D1 stands for twoway instruction cache and directmapped data cache. Note that sometimes victim buffer should be on, sometimes off Example Best Example Best padpcm I8KD4KI1D2 ucbqsort I4KDV4KI1D1 crc I2KDV4KI1D1 v42 I8KD8KI1D1 auto2 I4KD2KI1D1 adpcm I2KDV2KI1D1 bcnt I2KD2KI1D1 epic IV4KDV8KI1D1 bilv I4KD2KI1D1 jpeg I8KD2KI4D1 binary I4KD2KI1D1 mpeg2 I4KDV4KI1D1 blit I2KDV2KI1D1 g721 I8KDV2KI2D1 brev I4KD2KI1D1 art I4KDV2KI1D1 g3fax I4KDV2KI1D1 mcf I4KD4KI1D1 fir I4KD2KI1D1 parser I8KDV4KI4D1 pjepg I4KDV2KI1D1 vpr I8KD2KI2D1 pegw it I4KD4KI1D1 Frank Vahid, UC Riverside 10 Performance and Energy Benefits of Victim Buffer Added to a Configurable Cache 12% performance 8% 32% energy 23% 43% 4% 0% vpr parser mcf art jpeg mpeg pegwit g721 epic adpcm v42 ucbqsort pjeg fir g3fax brev blit binary bilv bcnt auto2 crc padpcm -4% An 8-line victim buffer with a configurable cache, whose associativity, size, and line size are configurable (0%=optimal config. without VB) Still surprisingly effective Frank Vahid, UC Riverside 11 Conclusion Configurable victim buffer useful with direct-mapped cache Configurable victim buffer also useful with configurable cache As much as 60% energy and 4% performance improvements for some applications Can shut off to avoid performance penalty on other apps. As much as 43% energy and 8% performance improvement for some applications Can shut off to avoid performance overhead on other applications Configurable victim buffer should be included as a softwareconfigurable parameter to direct-mapped as well as configurable caches for embedded system architectures Frank Vahid, UC Riverside 12