CRUISE: Cache Replacement and Utility-Aware Scheduling Aamer Jaleel, Hashem H. Najaf-abadi, Samantika Subramaniam, Simon Steely Jr., Joel Emer Intel Corporation, VSSAD Aamer.Jaleel@intel.com Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012) Motivation Core 0 L1 Core 0 Core 1 L1 L1 LLC LLC Single Core ( SMT ) Dual Core ( ST/SMT ) Core 0 Core 1 Core 2 Core 3 L1 L1 L1 L1 L2 L2 L2 L2 LLC Quad-Core ( ST/SMT ) • Shared last-level cache (LLC) common with increasing # of cores • # concurrent applications contention for shared cache 2 Misses Per 1000 Instr (under LRU) Problems with LRU-Managed Shared Caches • Conventional LRU policy allocates – Applications that have no cache benefit cause destructive cache interference h264ref soplex 0 resources based on rate of demand soplex h264ref 25 50 75 Cache Occupancy Under LRU Replacement (2MB Shared Cache) 100 3 Misses Per 1000 Instr (under LRU) Addressing Shared Cache Performance • Conventional LRU policy allocates – Applications that have no cache benefit cause destructive cache interference h264ref • State-of-Art Solutions: – Improve Cache Replacement (HW) – Modify Memory Allocation (SW) – Intelligent Application Scheduling (SW) soplex 0 resources based on rate of demand soplex h264ref 25 50 75 Cache Occupancy Under LRU Replacement (2MB Shared Cache) 100 4 HW Techniques for Improving Shared Caches • Modify cache replacement policy • Goal: Allocate cache resources based on cache utility NOT demand C0 C1 LLC LRU C0 C1 LLC Intelligent LLC Replacement 5 SW Techniques for Improving Shared Caches I • Modify OS memory allocation policy • Goal: Allocate pages to different cache sets to minimize interference Intelligent Memory Allocator (OS) C0 C1 C0 C1 LLC LLC LRU LRU 6 SW Techniques for Improving Shared Caches II • Modify scheduling policy using Operating System (OS) or hypervisor • Goal: Intelligently co-schedule applications to minimize contention C0 C1 LLC0 C2 C3 LLC1 LRU-managed LLC C0 C1 LLC0 C2 C3 LLC1 LRU-managed LLC 7 SW Techniques for Improving Shared Caches A C0 B C C1 C2 D • Three possible schedules: • • • C3 LLC1 LLC0 A, B | C, D A, C | B, D A, D | B, C Optimal / Worst Schedule 4.9 ~30% 5.5 6.3 Throughput Baseline System Worst Schedule Optimal Schedule (4-core CMP, 3-level hierarchy, LRU-managed LLC) 1,35 1,30 1,25 1,20 1,15 1,10 1,05 ~9% On Average 1,00 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 8 Interactions Between Co-Scheduling and Replacement Existing co-scheduling proposals evaluated on LRU-managed LLCs Question: Is intelligent co-scheduling necessary with improved cache replacement policies? DRRIP Cache Replacement [ Jaleel et al, ISCA’10 ] 9 Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Optimal / Worst Schedule ( DRRIP ) 1,28 • Category I: No need for intelligent co-schedule under both LRU/DRRIP 1,24 • Category II: Require intelligent co-schedule only under LRU 1,20 1,16 • Category III: Require intelligent co-schedule only under DRRIP 1,12 • Category IV: Require intelligent co-schedule under both LRU/DRRIP 1,08 1,04 1,00 1,00 1,04 1,08 1,12 1,16 1,20 1,24 1,28 Optimal / Worst Schedule ( LRU ) 10 Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Optimal / Worst Schedule ( DRRIP ) 1,28 • Category I: No need for intelligent co-schedule under both LRU/DRRIP 1,24 • Category II: Require intelligent co-schedule only under LRU 1,20 1,16 • Category III: Require intelligent co-schedule only under DRRIP 1,12 • Category IV: Require intelligent co-schedule under both LRU/DRRIP 1,08 1,04 1,00 1,00 1,04 1,08 1,12 1,16 1,20 1,24 1,28 Optimalmal / Worst Schedule ( LRU ) Observation: Need for Intelligent Co-Scheduling is Function of Replacement Policy 11 Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Optimal / Worst Schedule ( DRRIP ) 1,28 • Category II: Require intelligent co-schedule only under LRU 1,24 1,20 1,16 1,12 C0 1,08 LLC0 1,04 C2 C3 LLC1 LRU-managed LLCs 1,00 1,00 C1 1,04 1,08 1,12 1,16 1,20 1,24 1,28 Optimalmal / Worst Schedule ( LRU ) 12 Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Optimal / Worst Schedule ( DRRIP ) 1,28 • Category II: Require intelligent co-schedule only under LRU 1,24 1,20 1,16 1,12 C0 1,08 LLC0 1,04 C2 C3 LLC1 LRU-managed LLCs 1,00 1,00 C1 1,04 1,08 1,12 1,16 1,20 1,24 1,28 Optimalmal / Worst Schedule ( LRU ) 13 Interactions Between Optimal Co-Scheduling and Replacement (4-core CMP, 3-level hierarchy, per-workload comparison 1365 4-core multi-programmed workloads) Optimal / Worst Schedule ( DRRIP ) 1,28 • Category II: Require intelligent co-schedule only under LRU 1,24 1,20 1,16 1,12 C0 1,08 LLC0 1,04 C2 C3 LLC1 DRRIP-managed LLCs 1,00 1,00 C1 1,04 1,08 1,12 1,16 1,20 1,24 1,28 Optimalmal / Worst Schedule ( LRU ) No Re-Scheduling Necessary for Category II Workloads in DRRIP-managed LLCs 14 Opportunity for Intelligent Application Co-Scheduling • Prior Art: • Evaluated using inefficient cache policies (i.e. LRU replacement) • Proposal: Cache Replacement and Utility-aware Scheduling: • Understand how apps access the LLC (in isolation) • Schedule applications based on how they can impact each other • ( Keep LLC replacement policy in mind ) 15 Memory Diversity of Applications (In Isolation) LLCF LLCT LLCFR CCF Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 L2 L2 L2 L2 L2 L2 L2 L2 LLC Core Cache Fitting (e.g. povray*) LLC LLC Friendly (e.g. bzip2*) LLC LLC LLC Thrashing (e.g. bwaves*) LLC Fitting (e.g. sphinx3*) *Assuming a 4MB shared LLC 16 Cache Replacement and Utility-aware Scheduling (CRUISE) • Core Cache Fitting (CCF) Apps: • Infrequently access the LLC • Do not rely on LLC for performance • Co-scheduling multiple CCF jobs on same LLC “wastes” that LLC • Best to spread CCF applications CCF CCF Core 0 Core 1 Core 2 Core 3 L2 L2 L2 L2 LLC LLC across available LLCs 17 Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Thrashing (LLCT) Apps: • Frequently access the LLC • Do not benefit at all from the LLC • Under LRU, LLCT apps degrade performance of other applications • Co-schedule LLCT with LLCT apps LLCT LLCT Core 0 Core 1 Core 2 Core 3 L2 L2 L2 L2 LLC LLC 18 Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Thrashing (LLCT) Apps: • Frequently access the LLC • Do not benefit at all from the LLC • Under DRRIP, LLCT apps do not degrade performance of coscheduled apps • Best to spread LLCT apps across available LLCs to efficiently utilize cache resources LLCT LLCT Core 0 Core 1 Core 2 Core 3 L2 L2 L2 L2 LLC LLC 19 Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Fitting (LLCF) Apps: • Frequently access the LLC • Require majority of LLC • Behave like LLCT apps if they do not receive majority of LLC • Best to co-schedule LLCF with CCF applications (if present) LLCF LLCF CCF Core 0 Core 1 Core 2 Core 3 L2 L2 L2 L2 LLC LLC • If no CCF app, schedule with LLCF/LLCT 20 Cache Replacement and Utility-aware Scheduling (CRUISE) • LLC Friendly (LLCFR) Apps: • Rely on LLC for performance • Can share LLC with similar apps • Co-scheduling multiple LLCFR jobs on same LLC will not result in suboptimal performance LLCFR LLCFR Core 0 Core 1 Core 2 Core 3 L2 L2 L2 L2 LLC LLC 21 CRUISE for LRU-managed Caches (CRUISE-L) LLCT LLCT LLCF CCF • Applications: • Co-schedule apps as follows: • • • • Co-schedule LLCT apps with LLCT apps Spread CCF applications across LLCs Co-schedule LLCF apps with CCF Fill LLCFR apps onto free cores LLCT LLCF LLCT CCF Core 0 Core 1 Core 2 Core 3 L2 L2 L2 L2 LLC LLC 22 CRUISE for DRRIP-managed Caches (CRUISE-D) LLCT LLCT LLCFR • Applications: • Co-schedule apps as follows: • • • • CCF Spread LLCT apps across LLCs Spread CCF apps across LLCs Co-schedule LLCF with CCF/LLCT apps Fill LLCFR apps onto free cores LLCFR LLCT CCF LLCT Core 0 Core 1 Core 2 Core 3 L2 L2 L2 L2 LLC LLC 23 Experimental Methodology • System Model: • 4-wide OoO processor (Core i7 type) • 3-level memory hierarchy (Core i7 type) • Application Scheduler • Workloads • Multi-programmed combinations of SPEC CPU2006 applications • ~1400 4-core multi-programmed workloads (2 cores/LLC) • ~6400 8-core multi-programmed workloads (2 cores/LLC, 4 cores/LLC) 24 Experimental Methodology • System Model: • 4-wide OoO processor (Core i7 type) • 3-level memory hierarchy (Core i7 type) • Application Scheduler • Workloads A B C D C0 C1 C2 C3 LLC0 LLC1 Baseline System • Multi-programmed combinations of SPEC CPU2006 applications • ~1400 4-core multi-programmed workloads (2 cores/LLC) • ~6400 8-core multi-programmed workloads (2 cores/LLC, 4 cores/LLC) 25 CRUISE Performance on Shared Caches (4-core CMP, 3-level hierarchy, averaged across all 1365 multi-programmed workload mixes) Random CRUISE-D Distributed Intensity (ASPLOS’10) Optimal 1,04 1,02 O P T I M A L 1,06 C R U I S E - D O P T I M A L 1,08 C R U I S E - L Performance Relative to Worst Schedule 1,10 CRUISE-L 1,00 LRU-managed LLC • • DRRIP-managed LLC CRUISE provides near-optimal performance Optimal co-scheduling decision is a function of LLC replacement policy 26 Classifying Application Cache Utility in Isolation How Do You Know Application Classification at Run Time? x • Profiling: • Application provides memory intensity at run time • HW Performance Counters: x• Assume isolated cache behavior same as shared cache behavior x• Periodically pause adjacent cores at runtime • Proposal: Runtime Isolated Cache Estimator (RICE) • Architecture support to estimate isolated cache behavior while still sharing the LLC 27 Runtime Isolated Cache Estimator (RICE) • Assume a cache shared by 2 applications: APP0 APP1 Monitor isolated cache Monitor behavior. isolated Only cache APP0behavior. fills to these Only sets, APP1 all fills other to these apps sets, bypass all these other sets apps bypass these sets APP0 APP1 Miss Miss Counters to compute isolated hit/miss rate (apki, mpki) < P0, P1, P2, P3 > Follower Sets Set-Level View of Cache + Access + Access • 32 sets per APP • 15-bit hit/miss cntrs High-Level View of Cache 28 Runtime Isolated Cache Estimator (RICE) • Assume a cache shared by 2 applications: APP0 APP1 Monitor isolated cache behavior if only half the cache available. Only APP0 fills to half the ways in the sets. All other apps use these sets Needed to classify LLCF applications. Set-Level View of Cache APP0 APP0 APP1 APP1 + + + + Access-F Miss-F Access-H Miss-H Access-F Miss-F Access-H Miss-H Counters to compute isolated hit/miss rate (apki, mpki) < P0, P1, P2, P3 > Follower Sets • 32 sets per APP • 15-bit hit/miss cntrs High-Level View of Cache 29 Performance of CRUISE using RICE Classifier Performance Relative to Worst Schedule 1,30 CRUISE 1,25 Distributed Intensity (ASPLOS’10) Optimal 1,20 1,15 1,10 1,05 1,00 0,95 • CRUISE using Dynamic RICE Classifier Within 1-2% of Optimal 30 Summary • Optimal application co-scheduling is an important problem • Useful for future multi-core processors and virtualization technologies • Co-scheduling decisions are function of replacement policy • Our Proposal: • Cache Replacement and Utility-aware Scheduling (CRUISE) • Architecture support for estimating isolated cache behavior (RICE) • CRUISE is scalable and performs similar to optimal co-scheduling • RICE requires negligible hardware overhead 31 Q&A 32