Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz CONTENTS Introduction Hardware prefetching Hardware data prefetching methods Performance speedup Energy-aware PARE Conclusion prefetching techniques Introduction Data prefetching, is the process of fetching data that is needed in the program in advance, before the instruction that requires it is executed. It removes apparent memory latency. Two types Software prefetching Using compiler Hardware prefetching Using additional circuit Hardware prefetching Use additional circuit Prefetch tables are used to store recent load instructions and relations between load instructions. Better performance Energy overhead comes from Energy cost Hardware Data Prefetching Methods Sequential Stride prefetching prefetching Pointer prefetching Combined stride and pointer prefetching Sequential Prefetching One block lookahead (OBL) approach Initiate a prefetch for block b+1 when block b is accessed Prefetch_on_miss o Whenever an access for block b results in a cache miss Tagged prefetching Associates a tag bit with every memory block When a block is demand-fetched or a prefetched block is referenced for the first time next block is OBL Approaches Prefetch-on-miss Tagged prefetch Click to edit the outline text format demand-fetched prefetched demand-fetched prefetched Second Outline Level0 demand-fetched 1 prefetched Third Outline Level 0Fourth demand-fetched 0Outline prefetched 1Level prefetched Fifth Outline Level Stride Prefetching Employ special logic to monitor the processor’s address referencing pattern Detect constant stride array references originating from looping structures Compare successive addresses used by load or store instructions Reference Prediction Table (RPT) RPT 64 entries 64 bits Hold most recently used memory instructions Address of the memory instruction Previous address accessed by the instruction Stride State value field Organization of RPT PC effective address instruction tag previous address stride state + prefetch address Pointer Prefetching Effective No for pointer_intensive programs constant stride Dependence_based Use prefetching Detect dependence relationship two hardware tables Correlation table(CT) • Storing dependence information Combined Stride And Pointer Prefetching Objective to evaluate a technique that would work for all types of memory access patterns Use both array and pointer Better All performance three tables (RPT, PPW, CT) Performance Speedup Combined (stride+dep) technique has the best speedup for most benchmarks. no-prefetch sequential tagged stride dependence stride+dep 2.4 2.2 Speedup 2 1.8 1.6 1.4 1.2 1 0.8 mcf parser art bzip2 galgel bh em3d health mst perim avg Energy-aware Prefetching Architecture Compiler-Based LDQ RA RB OFFSET Hints Selective Filtering Filtered Regular Cache Access Compiler-Assisted Adaptive Prefetching Stride Counter Filtered Prefetch Filtering using Stride Counter Stride Prefetcher Pointer Prefetcher Prefetching Filtering Buffer (PFB) Filtered Hardware Filtering using PFB Prefetches Data-array Tag-array ...... ... ... ... Prefetch from L2 Cache ... L1 D-cache Energy-aware Prefetching Technique Compiler-Based Only Selective Filtering (CBSF) searching the prefetch hardware tables Compiler-Assisted Adaptive Prefetching (CAAP) Select different prefetching schemes Compiler-driven Filtering using Stride Counter (SC) Reduce prefetching energy Hardware-based Filtering using PFB (PFB) Compiler-based selective filtering Only searching the prefetch hardware tables for selective memory instructions identified by the compiler Energy reduced by Using loop or recursive type memory access Use only array and linked data structure memory access Compiler-assistive adaptive prefetching Select different prefetching scheme based on Memory access to an array which does not belongs to any larger structure are only fed into the stride prefetcher. Memory access to an array which belongs to a larger structure are fed into both stride and pointer Compiler-hinted Filtering Using A Runtime SC Reducing prefetching energy consumption wasted on memory access patterns with very small strides. Small strides are not used Stride can be larger than half the cache line size Each cache line contain Program Counter(PC) Stride counter PARE: A Power-aware Prefetch Engine Used Two for reducing power dissipation ways to reduce power Reduces the size of each entry • Based on spatial locality of memory accesses Partitions the large table into multiple smaller tables Hardware Prefetch Table Pare Hardware Prefetch Table Break up the whole prefetch table into 16 smaller tables Each It table containing 4 entries also contain a group number Only bits use lower 16 bit of the PC instead of 32 Pare Table Design Advantages Of Pare Hardware Table Power consumption reduced CAM cell power is reduced Small table Reduce total power consumption Conclusion Improve the performance Reduce the energy overhead of hardware data prefetching Reduce total energy consumption compiler-assisted and hardware-based energyaware techniques and a new power-aware prefetch engine techniques are used. References Yao Guo ,”Energy-Efficient Hardware Data Prefetching,” IEEE ,vol.19,no.2,Feb.2011 A. J. Smith, “Sequential program prefetching in memory hierarchies,”IEEE Computer, vol. 11, no. 12, pp. 7–21, Dec. 1978. A. Roth, A. Moshovos, and G. S. Sohi, “Dependence based prefetching for linked data structures,” in Proc. ASPLOS-VIII, Oct. 1998, pp.115–126.