Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo Charbon (TU Delft), Paolo Ienne (EPFL) Multicore Embedded Systems • Increasing number of multiprocessor based embedded systems. • Low energy requirement with little compromise on performance. • Significant energy consumption in the memory subsystem (caches, shared bus, main memory). 2 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Symmetric Multiprocessor System CPU 1 I$ D$ CPU n CPU 2 I$ I$ D$ Shared Memory 3 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture D$ Cache Coherency Problem CPU 1 I$ D$ CPU n CPU 2 I$ I$ D$ Shared Memory 4 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture D$ Snoopy Hardware Coherence Protocols CPU 1 I$ D$ CPU n CPU 2 I$ I$ D$ Snoop misses consume excessive energy Shared Memory 5 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture D$ Snoop Filters CPU 1 D$ I$ SF CPU n CPU 2 I$ D$ I$ SF Snoop filter lookup costs lesser energy than a cache SF lookup Shared Memory 6 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture D$ Snoop Filters in Prior Art • Include, Exclude and Hybrid JETTY – Expensive for an embedded system in terms of area. – Energy consumed by the JETTYs itself is significant. • Stream Registers – Present in IBM's BlueGene Supercomputer. – Inclusive filter. – Uses a base and mask register pair to track the cache lines. 7 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Stream Registers Base --- 0b1001 0b1010 Mask --- 0 No general mechanism to remove address from SR 1111 1001 without compromising correctness 1 1100 1 1001 Addresses with 10XX result in snoop filter hit 8 Valid Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Drawbacks of Stream Register based Snoop Filters • No efficient way to update the registers when a line is removed from cache – Degraded filtering performance over time – Additional logic units introduced but not efficient (e.g., cache wrap detection) 9 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Our Contribution • Counting Stream Registers – Eliminates cache wrap detection logic – Counter to track cache lines – More robust to workload variability – Better or similar energy savings compared to SRs 10 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Counting Stream Registers Base 0b1001 0b1010 Mask Removes ----- the need for extra logic such as cache wrap detection, active register history etc. 1111 1001 0x01 1100 0x02 1001 Invalidated cache lines can be tracked by decrementing the counter 11 Counter Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture 0 Snoop Filter Architecture Set of cache lines grouped into a page Index to direct mapped snoop filter table Used for comparison with base register Figure 2. CSR-filter hit detection logic. 12 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Figure 2 shows the snoop hit mechanism. The bus address Experimental Analysis • Virtex 2 FPGA running OpenRISC soft cores – Configurable no. of processors, associativity and size of data and instruction cache, cache type and coherence protocol • EEMBC Multibench Benchmarks • CACTI 5.3 energy model – Total memory subsystem energy accounted for main memory r/w energy, data and instruction cache r/w energy, leakage and snoop energy 13 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Cache Design Space Exploration 14 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Filtering Filterin Filteri lementary color level mplementary color levelbased basedononk k*/*/ MM =m - K; YY= =y y- K; =m - K; - K; Results: Filtering Percentage FilteringPercentage Percentage Filtering Filtering Percentage Filtering Percentage Filtering FilteringPercentage Percentage Filtering Percentage filtering percentage remains consistent for filters with 32 or (b) Number of Stream Registers NumberofofStream Stream Registers Number Registers (a) (b) more CSRs. results demonstrate that CSR filters are is is to to read image. The ep readinThese inthe theRGB RGB image. The CMYK CMYK more robust Assume toAssume workload variability ed sequentially. 4MBimage imagesize, size,than and aaSR-filters, while sequentially. 4MB and Image ImageRotation Rotation Image achieving a better overall filtering percentage. uration with SRs/CSRs. There is little little ation with 8 8RGB-to-CMYK SRs/CSRs. There isConversion h respect datapresent presentininthe thecache, cache, as as the the 88 respect to todata C. Energy Consumption mapped through3 3bits bitsofofthe thephysical physical address address apped through Our experimental analysis the following three 9-bit offset accountsfor forall all4 4MB MBconsiders theimage. image. bit offset accounts ofofthe system configurations: CSR-filters achievea ahigh highfiltering filteringpercentage. percentage. SR-filters achieve of SRs/CSRs increases,more morebits bitsofofthe theaddress address SRs/CSRs increases, WT: Write-through caches without snoop filters. do to access the SRs/CSRs, and fewer bits are access the SRs/CSRs, and fewer bits are eWTSR: image itself. TheCMYK CMYK valuewith calculation Write-through caches SR-filters. CSR achieves higher filtering mage itself. The value calculation isis aa t two step process involving several Number of Stream Registers (c)for smaller % of two step process Number involving several of Stream Registers Numbernumber of Stream Registers (b) (c) WTCSR: Write-through caches with CSR-filters. nd modifications. The SR-filters do not detect modifications. The SR-filters do not detect registers ions. This results in an increased amount of ns. This results indata an increased of H.264 Video Encoding Encoding and Decoding th respect to theHuffman present in amount cache (the Image Rotation H.264 Video Encoding respect to results the data in cache (the tion), which in apresent snoop filter hit, thereby n), which in a in snoop filter hit,The thereby ltering rateresults as shown Figure 5(a). 2MB ring rate as shown in Figure 5(a). The 2MB rast, does not have this problem. With 64 and t,ncertainty does notdisappears. have this With problem. more With SRs, 64 the and SRertainty disappears. Withwhether more SRs, SRascertain more accurately or notthe a given certain more accurately whether or not a given des in the cache. in the cache. lters act on the invalidations and modifications Number of Stream Registers y than thethe SR-filters, asNumber shown in Figure 5(b). (d) rs act on invalidations and modifications Number ofStream Stream Registers (a) of Registers (c) Number of Stream Registers ters, the are consistent 2MB image than the results SR-filters, as shownfor in aFigure 5(b). (d) he number CSRs in the for filter; for aimage 4MB s, the resultsofare consistent a 2MB Figure 4. The CSR-filter consistently achieves a higher filtering percenta Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture 15 16 CSRs appear to be insufficient, while the H.264 Video Encoding RGB-to-CMYK than the SR-filter. number of CSRs in the filter;Image for aConversion 4MB Figure 4. The CSR-filter consistently achieves a higher filtering perce Analysis: RGB2CMYK Benchmark Huffman Encod Huffman Enc Normalized Energy Normalized Energy Filtering Percentage Percentage Filtering SR-Filters SR-Filters (a) (a) (a)(a) Number NumberofofStream StreamRegisters Registers Image Ro Image FilteringPercentage Percentage Filtering Normalized Energy Normalized Energy CSR-filters CSR-filters (c) (c) 16 Number of Stream Registers Number of Stream Registers Figure 5. RGB to CMYK conversion with SR-filters (a) and CSR-filters (b). Figure 5. RGB CMYKrate conversion with SR-filters (a) and CSR-filters (b). For SR-filters thetofiltering is inconsistent as the number of SRs increases Fora SR-filters thefor filtering rate An is the inconsistent as the number SRsArchitecture increases for 4MB image; CSR-filters, filtering is non-decreasing when the Counting Stream Registers: Efficient andrate Effective Snoopof Filter for a 4MB image; for CSR-filters, the filtering rate is non-decreasing when the number of CSRs increases. TCP/IP Pack TCP/IP P alized Energy ized Energy (b) (b) Discussion: Energy Consumption • For most benchmarks, snoop energy was around 8-10% of the total memory subsystem energy without snoop filters • CSR filters more effective for certain benchmarks (H.264, Image rotation) – Better filtering performance with smaller no. of stream registers. • Small reduction in overall energy – Platform limited to 32 MB of off-chip SDRAM – No complex data sharing and limited no. of multiple producers of same data 17 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Summary • Introduced counting stream registers based snoop filter architecture – Lesser hardware complexity and ability to track cache line invalidations • Experimental evaluation shows better filtering percentage than stream registers with lesser performance variation for different workloads. 18 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture