Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen 1 The Multi-Core Challenge • Multi-core chip – Dominant on market – Last level cache is commonly shared by sibling cores, however sharing is not well controlled • Challenge: Performance Isolation source: http://www.intel.com – Poor performance due to conflicts – Unpredictable performance – Denial of service attacks 2 Possible Software Approach: Page Coloring Memory page • Partition cache at coarse granularity Cache Way-1 ………… Way-n Thread A • Page coloring: advocated by many previous works – [Bershad’94, Bugnion’96, Cho ‘06, Tam ‘07, Lin ‘08, Soares ‘08] Thread B CacheSize Color # = PageSize*CacheAssociativity 3 Challenges for Page Coloring • Expensive page re-coloring – Re-coloring is needed due to optimization goal or co-runner change – Without extra support, re-coloring means memory copying – 3 micro-seconds per page copy, >10K pages to copy, possibly happen every time quantum • Artificial memory pressure – Cache share restriction also restricts memory share 4 Hotness-based Page Coloring • Basic idea – Restrain page coloring to a small group of hot pages • Challenge: – How to efficiently find out hot pages • Outline – Efficient hot page identification – Cache partition policy – Hot page coloring 5 Method to Track Page Hotness • Hardware access bits + sequential table scan – Generally available on x86, automatically set by hardware – One bit per Page Table Entry (PTE) • Conventional wisdom: scan whole page table is expensive – Not entirely true, per-entry scan latency is overlapped by hardware prefetching – Sequential table scan spends a large portion of time on non-accessed pages, but we can improve that 6 Accelerate Sequential Scan Plot for SPECcpu2k benchmark mesa • Program exhibits spatial locality even at page granularity Prob – Page non-access correlation metric: Prob (next X neighbors are not accessed | current page is not accessed) # of contiguous non-accessed pages 7 Locality-based Jumping • Start with sequential mode – change to jumping mode once see non-accessed one • If an entry we jumped to is – not accessed, double the next jump range – accessed, roll back and reset jump range to 1, change to sequential mode • Randomized to avoid overlooking pathological access patterns Access bit Page 0 1X Page 1 1 X Page 2 0 Page 3 0 Page 4 0 Page 5 0 Page 6 0 Page 7 0 Page 8 0 Page 9 0 Page 10 0 Page 11 0 Page 12 1X Page 13 0 Roll back 8 Sampling of Access Bits • Recycle spare bits in PTE as hotness counter – Counter is aged to reflect recency and frequency – Could be extended to support LFU page replacement • Decouple sampling frequency and window – Sampling frequency N – Sampling time window T Clear all access bits 0 N 2N 3N 4N Time T N+T Check all access bits 2N+T 3N+T 4N+T 9 Hot Page Identification Efficiency • Entries skipped using locality-based jumping: > 60% on avg. • Runtime overhead – Tested 12 SPECcpu2k benchmarks on a Intel 3.0 Ghz core2duo processor – On avg. 2%/7% overhead at 100/10 milliseconds sampling frequency – Save 20%/58% over sequential scan 10 Hot Page Identification Accuracy • No major accuracy loss due to jumping as measured by two metrics (Jeffrey divergence & rank error rate) • Fairly accurate result 11 Roadmap • Efficient hot page identification - locality jumping • Cache partition policy - MRC-based • Hot page coloring 12 Cache Partition Policy • Miss-Ratio-Curve (MRC) based performance model – MRC profiled offline – Single app’s execution time ≈ Miss * Memory_Latency + Hit * Cache_Latency • Cache partition goal: optimize system overall performance – System performance metric: geometric mean of all apps’ normalized performance. Normalization baseline is the performance when one monopolize whole cache 13 MRC-driven Cache Partition Policy Geometric mean of two apps’ normalized performance = 0.2 0.5 0.7 0.3 Thread A’s Miss Ratio Cache Allocation 0 4M Cache Size = ∑A,B Cache Allocation Thread B’s Miss Ratio Cache Allocation 0 4M Optimal partition point 14 Hot Page Coloring • Budget control of page re-coloring overhead – % of time slice, e.g. 5% • Recolor from hottest until budget is reached – Maintain a set of hotness bins during sampling • bin[ i ][ j ] = # of pages in color i with normalized hotness in range [ j, j+1] – Given a budget K, K-th hottest page’s hotness value is estimated in constant time by searching hotness bins – Make sure hot pages are uniformly distributed among colors 15 Re-coloring Procedure Cache share decrease Budget = 3 pages hotness counter value 1 3 2 1 14 11 12 10 … … … … … … … … … … … … 71 75 73 74 83 82 81 87 100 99 97 99 Color Red Color Blue Color Green Page sorted in hotness ascending order X Color Gray 16 Performance Comparisons {art, equake} vs. {mcf, twolf} 4 SPECcpu2k benchmarks are running on 2 sibling cores (Intel core2duo) that share a 4MB L2 cache. 17 Relieve Artificial Memory Pressure Thread A’s footprint Memory page • App’s footprint may be larger than its entitled memory color pages Cache Way-1 • App may “steal” other’s colors, a.k.a. “polluting” other’s cache share ………… A1 A2 Way-n Thread A A3 Thread B • Hotness-aware pollution that preferentially copies cold pages to other’s memory colors (in round-robin fashion as not to impose new pressure) A4 A5 18 Relieve Artificial Memory Pressure On a dual-core chip, L2 cache was originally evenly partitioned between polluting and victim benchmarks. Because of memory pressure, polluting benchmark moves 1/3 (~62MB) of its footprint to victim’s shares. Non-space-sensitive Space-sensitive apps apps 19 Summary • Contributions: – Efficient hot page identification that can potentially be used by multiple applications – Hotness-based page coloring to mitigate two drawbacks: memory pressure & re-coloring cost • Caveat: large time quantum still required to amortize overhead • Ongoing work: – exploring other possible approaches • e.g. Execution throttling based cache management [USENIX’09] 20