HiPEAC 2012, Paris (France) – January 23, 2012 Javier Lira (Intel-UPC, Spain) Timothy M. Jones (U. of Cambridge, UK) javierx.lira@intel.com timothy.jones@cl.cam.ac.uk Carlos Molina (URV, Spain) Antonio González (Intel-UPC, Spain) carlos.molina@urv.net antonio.gonzalez@intel.com Intel® 24 MBytes CMPs have become the dominant paradigm. Nehalem IBM® 32 MBytes Incorporate large shared lastlevel caches. Access latency in large caches is dominated by wire delays. POWER7 Tilera® 32 MBytes 2 Tile-GX NUCA divides a large cache in smaller and faster banks. Cache access latency consists of the routing and bank access latencies. Banks close to cache controller have smaller latencies than further banks. Processor [1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS’02 3 Data can be mapped in multiple banks. Migration allows data to adapt to application’s behaviour. S-NUCA D-NUCA Migration movements are effective, but about 50% of hits still happen in non-optimal banks. 4 Introduction Methodology The Migration Prefetcher Analysis of results Conclusions 5 Placement 16 positions per data Partitioned multicast Access Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Migration Gradual promotion LRU + Zero-copy Replacement [2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO’04 6 SPEC CPU2006 PARSEC Solaris 10 8 x UltraSPARC IIIi Simics GEMS Ruby Garnet Orion Number of cores 8 – UltraSPARC IIIi Frequency 1.5 GHz Main Memory Size 4 Gbytes Memory Bandwidth 512 Bytes/cycle Private L1 caches 8 x 32 Kbytes, 2-way Shared L2 NUCA cache 8 MBytes, 128 Banks NUCA Bank 64 KBytes, 8-way L1 cache latency 3 cycles NUCA bank latency 4 cycles Router delay 1 cycle On-chip wire delay 1 cycle Main memory latency 250 cycles (from core) 7 Introduction Methodology The Migration Prefetcher Analysis of results Conclusions 8 Uses prefetching principles on data migration. This not a traditional prefetcher. ◦ It does not bring data from main memory. ◦ Potential benefits are much restricted. Require simple data correlation. 9 Core 0 Core 1 Core 2 Core 3 NAT A PS Core 4 Core 5 @ Next Address B Bank 5 DataBblock Core 6 Core 7 10 • 1 confidence bit is effective. • > 1 bit is not worthy. Fraction of prefetching requests that ended up being useful. 11 • 12-14 bits use about 25% of erroneous information. • NAT with 12 addressable bits is 232 KBytes in total. Percentage of prefetching requests submitted with other address’s information. 12 • Predicting data location in based on the last appearance provides 50% accuracy. • Accuracy increases accessing to local bank. Percentage of prefetching requests that are found in the NUCA cache. 13 The realistic Migration Prefetcher uses: ◦ 1-bit confidence for data patterns. ◦ A NAT with 12 addressable bits (29KBytes/table). ◦ Last responder + Local as search scheme. Total hardware overhead is 264 KBytes. Latency: 2 cycles. 14 Introduction Methodology The Migration Prefetcher Analysis of results Conclusions 15 16 NUCA is up to 25% faster with the Migration Prefetcher. Reduces NUCA cache latency by 15%, on average. Achieves overall performance improvements of 4%, and up to 17%. 17 The prefetcher introduces extra traffic into the network. In case of hit, reduces the number of messages significantly. This technique does not increase energy consumption. 18 Introduction Methodology The Migration Prefetcher Analysis of results Conclusions 19 Existing migration techniques effectively concentrate most accessed data to banks that are close to the cores. About 50% of hits in NUCA are in non-optimal banks. The Migration Prefetcher anticipates migrations based on the past. It reduces the average NUCA latency by 15%. Outperforms the baseline configuration by 4%, on average, and does not increase energy consumption. 20 HiPEAC 2012, Paris (France) – January 23, 2012 Questions?