ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 6 Fair Caching Mechanisms for CMP Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04] Processor Core 1 Processor Core 2 L1 $ L1 $ L2 $ …… [Kim, Chandra, Solihin PACT2004] Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 2 Cache Sharing in CMP Processor Core 1 ←t1 Processor Core 2 L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 3 Cache Sharing in CMP t2→ Processor Core 2 Processor Core 1 L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 4 Cache Sharing in CMP Processor Core 1 ←t1 t2→ Processor Core 2 L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 5 Shared L2 Cache Space Contention 10 8 gzip's Normalized Cache Misses Per Instruction 6 4 2 0 gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim gzip(alone) gzip+applu gzip+apsi gzip+art gzip+swim 1.2 1 gzip's 0.8 Normalized 0.6 IPC 0.4 0.2 0 Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 6 Impact of Unfair Cache Sharing • Uniprocessor scheduling t1 t2 time slice t3 t1 • 2-core CMP scheduling P1: P2: t1 t2 t1 t3 t4 time slice t1 t2 t1 t3 t1 t4 • gzip will get more time slices than others if gzip is set to run at higher priority (and it could run slower than others priority inversion) • It could further slows down the other processes (starvation) • Thus the overall throughput is reduced (uniform slowdown) 7 ECE8833 H.-H. S. Lee 2009 7 Stack Distance Profiling Algorithm HIT Counters CTR Pos 0 Cache Tag MRU HIT Counters Value CTR Pos 0 CTR Pos 1 CTR Pos 2 CTR Pos 3 30 20 15 10 CTR Pos 1 CTR Pos 2 CTR Pos 3 LRU Misses = 25 [Qureshi+, MICRO-39] ECE8833 H.-H. S. Lee 2009 8 Stack Distance Profiling • A counter for each cache way, C>A is the counter for misses • Show the reuse frequency for each way in a cache • Can be used to predict the misses for associativity smaller than “A” – Misses for 2-way cache for gzip = C>A + Σ Ci where i = 3 to 8 • art does not need all the space for likely poor temporal locality • If the given space is halved for art and given to gzip, what happens? ECE8833 H.-H. S. Lee 2009 9 Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown T _ shared i T _ shared j T _ alonei T _ alone j Execution time of ti when it runs alone. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 10 Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown T _ shared i T _ shared j T _ alonei T _ alone j Execution time of ti when it shares cache with others. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 11 Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown T _ shared i T _ shared j T _ alonei T _ alone j • We want to minimize: – Ideally: M 0ij X i X j , where X i T _ shared i T _ alonei Try to equalize the ratio of miss increase of each thread Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 12 Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown T _ shared i T _ shared j T _ alonei T _ alone j • We want to minimize: – Ideally: M 0ij X i X j , where X i T _ shared i T _ alonei Miss _ shared i M X i X j , where X i Miss _ alonei ij 1 M 3ij X i X j , where X i MissRate _ shared i MissRate _ alonei M 5ij X i X j , where X i MissRate _ shared i MissRate _ alonei Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 13 Partitionable Cache Hardware • Modified LRU cache replacement policy – G. E. Suh, et. al., HPCA 2002 P2 Miss Current Partition Target Partition P1: 448B P2: 576B P1: 384B P2: 640B LRU Per-thread Counter LRU LRU LRU Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 14 Partitionable Cache Hardware • Modified LRU cache replacement policy – G. Suh, et. al., HPCA 2002 P2 Miss Current Partition Target Partition P1: 448B P2: 576B P1: 384B P2: 640B LRU LRU LRU * LRU Current Partition Target Partition P1: 384B P2: 640B P1: 384B P2: 640B LRU * Partition granularity could be as coarse as one entire cache way LRU LRU LRU Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 15 Dynamic Fair Caching Algorithm MissRate alone Ex) Optimizing M3 metric MissRate shared P1: P1: Counters to keep miss rates running the process alone (from stack distance profiling) P2: Counters to keep dynamic miss rates (running with a shared cache) P2: Target Partition P1: P2: 10K accesses found to be the best Repartitioning interval Counters to keep target partition size Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 16 Dynamic Fair Caching Algorithm MissRate alone 1st Interval P1:20% P2: 5% MissRate shared P1: P1:20% P2: P2:15% Target Partition Repartitioning interval P1:256KB P2:256KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 17 Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% MissRate shared P1:20% Evaluate M3 P1: 20% / 20% P2: 15% / 5% P2:15% Target Partition P1:192KB P1:256KB P2:320KB P2:256KB Repartitioning interval Partition granularity: 64KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 18 Dynamic Fair Caching Algorithm MissRate alone 2nd Interval P1:20% P2: 5% MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:15% Target Partition Repartitioning interval P1:192KB P2:320KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 19 Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:10% Target Partition Evaluate M3 P1: 20% / 20% P2: 10% / 5% Repartitioning interval P1:128KB P1:192KB P2:384KB P2:320KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 20 Dynamic Fair Caching Algorithm MissRate alone 3rd Interval P1:20% P2: 5% MissRate shared MissRate shared P1:20% P1:20% P1:25% P2:10% P2:10% P2: 9% Target Partition Repartitioning interval P1:128KB P2:384KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 21 Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% MissRate shared MissRate shared P1:20% P1:25% P2:10% P2: 9% Target Partition The best Trollback threshold found to be 20% Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew Repartitioning interval P1:128KB P1:192KB P2:384KB P2:320KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU ECE8833 H.-H. S. Lee 2009 22 Generic Repartitioning Algorithm Pick the largest and smallest as a pair for repartitioning Repeat for all candidate processes ECE8833 H.-H. S. Lee 2009 23 Utility-Based Cache Partitioning (UCP) Running Processes on Dual-Core [Qureshi & Patt, MICRO-39] # of ways given (1 to 16) # of ways given (1 to 16) • LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr • UTIL – How much you use (in a set) is how much you will get – Ideally, 3 ways to equake and 13 to vpr ECE8833 H.-H. S. Lee 2009 25 Defining Utility Misses per 1000 instructions Utility Uab = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility Num ways from 16-way 1MB L2 Slide courtesy: Moin Qureshi, MICRO-39 ECE8833 H.-H. S. Lee 2009 26 Framework for UCP PA UMON1 I$ Core1 D$ UMON2 Shared L2 cache I$ D$ Core2 Main Memory Three components: Utility Monitors (UMON) per core Partitioning Algorithm (PA) Replacement support to enforce partitions Slide courtesy: Moin Qureshi, MICRO-39 ECE8833 H.-H. S. Lee 2009 27 Utility Monitors (UMON) For each core, simulate LRU policy using Auxiliary Tag Dir (ATD) UMON-global (one way-counter for all sets) Hit counters in ATD to count hits per recency position LRU is a stack algorithm: hit counts utility E.g., hits(2 ways) = H0+H1 (MRU) H0 H1 H2 H3 + + + + H15(LRU) ... + Set A Set B Set C Set D Set E Set F Set G Set H ATD ECE8833 H.-H. S. Lee 2009 28 Utility Monitors (UMON) Extra tags incur hardware and power overhead DSS reduces overhead [Qureshi et al. ISCA’06] (MRU) H0 H1 H2 H3 + + + + Set A Set B Set C Set D Set E Set F Set G Set H H15(LRU) ... + Set A Set B Set C Set D Set E Set F Set G Set H ATD ECE8833 H.-H. S. Lee 2009 29 Utility Monitors (UMON) Extra tags incur hardware and power overhead DSS reduces overhead [Qureshi et al. ISCA’06] 32 sets sufficient based on Chebyshev’s inequality Sample every 32 sets (simple static) used in the paper Storage < 2KB/UMON (or 0.17% L2) (MRU) H0 H1 H2 H3 + + + + Set A Set B Set C Set D Set E Set F Set G Set H H15(LRU) ... + Set B Set E Set F UMON (DSS) ATD ECE8833 H.-H. S. Lee 2009 30 Partitioning Algorithm (PA) Evaluate all possible partitions and select the best With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 Select a that maximizes (Hitscore1 + Hitscore2) Partitioning done once every 5 million cycles After each partitioning interval Hit counters in all UMONs are halved To retain some past information ECE8833 H.-H. S. Lee 2009 31 Replacement Policy to Reach Desired Partition Use way partitioning [Suh+ HPCA’02, Iyer ICS’04] • Each Line contains core-id bits • On a miss, count ways_occupied in the set by miss-causing app • Binary decision for dual-core (in this paper) ways_occupied < ways_given Yes Victim is the LRU line from other app ECE8833 H.-H. S. Lee 2009 No Victim is the LRU line from miss-causing app 32 UCP Performance (Weighted Speedup) UCP improves average weighted speedup by 11% (Dual Core) ECE8833 H.-H. S. Lee 2009 33 UPC Performance (Throughput) UCP improves average throughput by 17% ECE8833 H.-H. S. Lee 2009 34 Dynamic Insertion Policy Conventional LRU MRU LRU Incoming Block Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 Slide Source: Yuejian Xie 36 Conventional LRU MRU LRU Occupies one cache block for a long time with no benefit! Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 37 LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Incoming Block Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 38 LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Useless Block Evicted at next eviction Useful Block Moved to MRU position Adapted Slide from Yuejian Xie ECE8833 H.-H. S. Lee 2009 39 LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 LRU Useless Block Evicted at next eviction Useful Block Moved to MRU position LIP is not entirely new, Intel has tried this in 1998 when designing “Timna” (integrating CPU and Gfx accelerator that share L2) 40 BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07] LIP may not age older lines Infrequently insert lines in MRU position Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position; // LRU replacement policy else Insert at LRU position; Promote to MRU if reused ECE8833 H.-H. S. Lee 2009 41 DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07] Two types of workloads: LRU-friendly or BIP-friendly DIP DIP can be implemented by: 1. Monitor both policies (LRU and BIP) 2. Choose the best-performing policy 3. Apply the best policy to the cache BIP LRU 1-ε ε LIP LRU Need a cost-effective implementation “Set Dueling” ECE8833 H.-H. S. Lee 2009 42 Set Dueling for DIP [Qureshi et al. ISCA’07] Divide the cache in three: • • • Dedicated LRU sets Dedicated BIP sets Follower sets (winner of LRU,BIP) n-bit saturating counter misses to LRU sets: counter++ misses to BIP sets : counter-- LRU-sets BIP-sets miss miss + – n-bit cntr MSB = 0? YES Follower Sets Use LRU No Use BIP Counter decides policy for follower sets: • • MSB = 0, Use LRU MSB = 1, Use BIP monitor choose apply (using a single counter) Slide Source: Moin Qureshi ECE8833 H.-H. S. Lee 2009 43 Promotion/Insertion Pseudo Partitioning PIPP [Xie & Loh ISCA’09] • What’s PIPP? – Promotion/Insertion Pseudo Partitioning – Achieving both capacity (UCP) and dead-time management (DIP). • Eviction – LRU block as the victim • Insertion – The core’s quota worth of blocks away from LRU • Promotion – To MRU by only one. Promote MRU Hit New Insert Position = 3 (Target Allocation) To Evict LRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 45 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block Request D Core1’s quota=3 1 A MRU 2 3 4 B 5 C LRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 46 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block Request 6 Core0’s quota=5 1 A MRU 2 3 4 D B 5 LRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 47 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block Request 7 Core0’s quota=5 1 A MRU 2 6 3 4 D B LRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 48 PIPP Example Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block Request D 1 A MRU 2 7 6 3 4 D LRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 49 How PIPP Does Both Management Quota Core0 Core1 Core2 Core3 6 4 4 2 MRU LRU Insert closer to LRU position Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 50 Pseudo Partitioning Benefits Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block MRU1 LRU1 Request New Strict Partition MRU0 LRU0 Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 51 Pseudo Partitioning Benefits Core0 quota: 5 blocks Core1 quota: 3 blocks Core0’s Block Core1’s Block Request New Pseudo Partition MRU Core1 “stole” a line from Core0 LRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 52 Pseudo Partitioning Benefits ECE8833 H.-H. S. Lee 2009 53 Promote By One (PIPP) Directly to MRU (TADIP) Single Reuse Block New MRU LRU New MRU LRU Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 54 Algorithm Comparison Algorithm Capacity Management Dead-time Management Note LRU Baseline, no explicit management UCP Strict partitioning DIP / TADIP Insert at LRU and promote to MRU on hit PIPP Pseudo-partitioning and incremental promotion Slide Source: Yuejian Xie ECE8833 H.-H. S. Lee 2009 55