Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1 Shared Resource Interference Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Shared Cache Main Memory 2 6 6 5 5 Slowdown Slowdown High and Unpredictable Application Slowdowns 4 3 2 1 0 4 3 2 1 0 leslie3d (core 0) gcc (core 1) leslie3d (core 0) mcf (core 1) application slowdowns depends due to 2.1. AnHigh application’s performance shared resource interference on which application it is running with 3 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference 4 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference 5 Background: Main Memory Row-buffer miss hit Memory Controller Rows Columns Bank 0 Bank 1 Bank 2 Bank 3 Row Row Buffer Buffer Row Buffer Row Buffer Row Buffer Channel • FR-FCFS Memory Scheduler [Zuravleff and Robinson, US Patent ‘97; Rixner et al., ISCA ‘00] – Row-buffer hit first – Older request first • Unaware of inter-application interference 6 Tackling Inter-Application Interference: Memory Request Scheduling • Monitor application memory access characteristics • Rank applications based on memory access characteristics • Prioritize requests at the memory controller, based on ranking 7 An Example: Thread Cluster Memory Scheduling Memory-non-intensive thread Throughput thread thread thread thread thread Non-intensive cluster higher priority higher priority Prioritized thread Threads in the system Memory-intensive Intensive cluster Fairness Figure: Kim et al., MICRO 2010 8 Problems with Previous Application-aware Memory Schedulers • Hardware Complexity – Ranking incurs high hardware cost • Unfair slowdowns of some applications – Ranking causes unfairness 9 High Hardware Complexity • Ranking incurs high hardware cost – Rank computation incurs logic/storage cost – Rank enforcement requires comparison logic 80000 8 6 4 2 FRFCFS 8x TCM Area (in square um) Latency (in ns) 10 60000 1.8x 40000 FRFCFS TCM 20000 Avoid ranking to achieve low hardware cost 0 0 Synthesized with a 32nm standard cell library 10 astar soplex • Lower-rank applications experience 40 80 30 significant slowdowns 60 50 Number of Requests Number of Requests Ranking Causes Unfair Application Slowdowns 100 40 TCM – Low memory service causes slowdown 20 Grouping – Periodic rank shuffling not0sufficient 20 10 0 0 50 100 0 12 10 10 8 8 6 TCM Grouping Grouping 100 Execution Time (in 1000s of Cycles) Slowdown Slowdown Execution Time (in 1000s of Cycles) 50 TCM 6 TCM 4 Grouping Grouping offers lower unfairness than ranking 2 4 2 0 0 11 Problems with Previous Application-Aware Memory Schedulers • Hardware Complexity – Ranking incurs high hardware cost • Unfair slowdowns of some applications – Ranking causes unfairness Our Goal: Design a memory scheduler with Low Complexity, High Performance, and Fairness 12 Towards a New Scheduler Design • Monitor applications that have a number of consecutive requests served 1. Simple Grouping Mechanism • Blacklist such applications • Prioritize requests of non-blacklisted applications 2. Enforcing Priorities Based On Grouping • Periodically clear blacklists 13 Methodology • Configuration of our simulated system – – – – 24 cores 4 channels, 8 banks/channel DDR3 1066 DRAM 512 KB private cache/core • Workloads – SPEC CPU2006, TPCC, Matlab – 80 multi programmed workloads 14 Metrics • System Performance: N Harmonic Speedup IPCialone IPCshared i i Slowdown IPCishared Weighted Speedup alone i IPCi 10 6 4 2 0 0 0.5 1 Speedup • Fairness: Maximum Slowdown max 8 alone i shared i IPC IPC 15 Previous Memory Schedulers • FR-FCFS [Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000] – Prioritizes row-buffer hits and older requests – Application-unaware • PARBS [Mutlu and Moscibroda, ISCA 2008] – Batches oldest requests from each application; prioritizes batch – Employs ranking within a batch • ATLAS [Kim et al., HPCA 2010] – Prioritizes applications with low memory-intensity • TCM [Kim et al., MICRO 2010] – Always prioritizes low memory-intensity applications – Shuffles request priorities of high memory-intensity applications 16 Performance Results 8 15 FRFCFS FRFCFS-Cap 6 PARBS 4 ATLAS TCM 2 0 Blacklisting Maximum Slowdown Weighted Speedup 10 FRFCFS 10 FRFCFS-Cap PARBS ATLAS 5 TCM Blacklisting 0 Approaches fairness of PARBS and FRFCFS-Cap achieving better 5% higher system performance and 25% performance than TCM lower maximum slowdown than TCM 17 Complexity Results Latency (in ns) 10 8 6 120000 FRFCFS FRFCFS-Cap PARBS ATLAS 4 TCM 2 Blacklisting 0 Area (in square um) 12 100000 80000 60000 FRFCFS FRFCFS-Cap PARBS ATLAS 40000 TCM 20000 Blacklisting 0 Blacklisting achieves 43%lower lowerlatency area than 70% thanTCM TCM 18 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference 19 Need for Predictable Performance • There is a need for predictable performance – When multiple applications share resources – Especially if some applications require performance guarantees • Example 1: In step: server systems As a first Predictable performance – Different users’ jobs consolidated onto the same server in thetopresence of memory – Need provide bounded slowdowns tointerference critical jobs • Example 2: In mobile systems – Interactive applications run with non-interactive applications – Need to guarantee performance for interactive applications 20 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference 21 Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 22 Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 23 Slowdown: Definition Performance Alone Slowdown Performance Shared 24 Key Observation 1 Normalized Performance For a memory bound application, Performance Memory request service rate 1 omnetpp 0.9 Difficult mcf 0.8 Requestastar Service Performanc e AloneRate Alone Intel Core i7, 4 cores Slowdown0.6 Mem. Bandwidth: 8.5 GB/s Performanc e SharedRate Request Service Shared 0.5 0.7 0.4 Easy 0.3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Request Service Rate 25 Key Observation 2 Request Service Rate Alone (RSRAlone) of an application can be estimated by giving the application highest priority in accessing memory Highest priority Little interference (almost as if the application were run alone) 26 Key Observation 2 1. Run alone Time units Request Buffer State Service order 3 2 Main Memory 1 Main Memory 2. Run with another application Time units Request Buffer State Main Memory Service order 3 2 1 Main Memory 3. Run with another application: highest priority Time units Request Buffer State Main Memory 3 Service order 2 1 Main Memory 27 Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications Request Service Rate Alone (RSRAlone) Slowdown Request Service Rate Shared (RSRShared) 28 Key Observation 3 • Memory-bound application Compute Phase Memory Phase No interference With interference Req Req Req time Req Req Req time Memory phase slowdown dominates overall slowdown 29 Key Observation 3 • Non-memory-bound application Compute Phase Memory Phase Memory Interference-induced Slowdown Estimation 1 (MISE) model for non-memory bound applications No interference RSRAlone time Slowdown (1 - ) RSRShared With interference 1 RSRAlone RSRShared time Only memory fraction () slows down with interference 30 Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 31 MISE Operation: Putting it All Together Interval Interval time Measure RSRShared, Estimate RSRAlone Measure RSRShared, Estimate RSRAlone Estimate slowdown Estimate slowdown 32 Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 33 Previous Work on Slowdown Estimation • Previous work on slowdown estimation – STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07] – FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ‘10] • Basic Idea: Difficult Stall Time Alone Slowdown Stall Time Shared Easy Count number of cycles application receives interference 34 Two Major Advantages of MISE Over STFM • Advantage 1: – STFM estimates alone performance while an application is receiving interference Difficult – MISE estimates alone performance while giving an application the highest priority Easier • Advantage 2: – STFM does not take into account compute phase for non-memory-bound applications – MISE accounts for compute phase Better accuracy 35 Methodology • Configuration of our simulated system – – – – 4 cores 1 channel, 8 banks/channel DDR3 1066 DRAM 512 KB private cache/core • Workloads – SPEC CPU2006 – 300 multi programmed workloads 36 Quantitative Comparison SPEC CPU 2006 application leslie3d Slowdown 4 3.5 3 2.5 Actual STFM MISE 2 1.5 1 0 20 40 60 80 100 Million Cycles 37 4 4 3 3 3 2 1 2 1 4 Average of MISE: 0 50 0error 50 100 8.2% 100 cactusADM GemsFDTD soplex Average error of STFM: 29.4% 4 4 (across 300 workloads) 3 3 Slowdown 3 2 1 0 2 1 0 0 1 50 Slowdown 0 2 0 0 0 Slowdown Slowdown 4 Slowdown Slowdown Comparison to STFM 50 wrf 100 100 2 1 0 0 50 calculix 100 0 50 100 povray 38 Predictability in the Presence of Memory Interference 1. Estimate Slowdown – Key Observations – MISE Operation: Putting it All Together – Evaluating the Model 2. Control Slowdown – Providing Soft Slowdown Guarantees – Minimizing Maximum Slowdown 39 MISE-QoS: Providing “Soft” Slowdown Guarantees • Goal 1. Ensure QoS-critical applications meet a prescribed slowdown bound 2. Maximize system performance for other applications • Basic Idea – Allocate just enough bandwidth to QoS-critical application – Assign remaining bandwidth to other applications 40 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference 41 A Recap • Problem: Shared resource interference causes high and unpredictable application slowdowns • Approach: – Simple mechanisms to mitigate interference – Slowdown estimation models – Slowdown control mechanisms • Future Work: – Extending to shared caches 42 Shared Cache Interference Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Shared Cache Main Memory 43 Impact of Cache Capacity Contention Shared Main Memory Shared Main Memory and Caches 2 Slowdown Slowdown 2 1.5 1 0.5 0 1.5 1 0.5 0 bzip2 (core 0) soplex (core 1) bzip2 (core 0) soplex (core 1) Cache capacity interference causes high application slowdowns 44 Backup Slides 45 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference • Coordinated cache/memory management for performance • Cache slowdown estimation • Coordinated cache/memory management for predictability 46 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference • Coordinated cache/memory management for performance • Cache slowdown estimation • Coordinated cache/memory management for predictability 47 Request Service vs. Memory Access Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Memory Access Rate Shared Cache Main Memory Request Service Rate Request service and access rates tightly coupled 48 Estimating Cache and Memory Slowdowns Through Cache Access Rates Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Access Rate Shared Cache Main Memory Cache Access Rate Alone Slowdown Cache Access Rate Shared 49 1.5 1.4 1.3 1.2 1.1 1 Slowdown Slowdown Cache Access Rate vs. Slowdown 1 1.2 1.4 Cache Access Rate Ratio bzip2 2.2 2 1.8 1.6 1.4 1.2 1 1 1.5 2 2.5 Cache Access Rate Ratio xalancbmk 50 Challenge How to estimate alone cache access rate? Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Access Rate Shared Cache Auxiliary Tags Main Memory Priority 51 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference • Coordinated cache/memory management for performance • Cache slowdown estimation • Coordinated cache/memory management for predictability 52 Leveraging Slowdown Estimates for Performance Optimization • How do we leverage slowdown estimates to achieve high performance by allocating – Memory bandwidth? – Cache capacity? • Leverage other metrics along with slowdowns – Memory intensity – Cache miss rates 53 Coordinated Resource Allocation Schemes Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache capacity-aware bandwidth allocation Shared Cache Main Memory Bandwidth-aware cache capacity allocation 54 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference • Coordinated cache/memory management for performance • Cache slowdown estimation • Coordinated cache/memory management for predictability 55 Coordinated Resource Management Schemes for Predictable Performance Goal: Cache capacity and memory bandwidth allocation for an application to meet a bound Challenges: • Large search space of potential cache capacity and memory bandwidth allocations • Multiple possible combinations of cache/memory allocations for each application 56 Outline Goals: 1. High performance 2. Predictable performance • Blacklisting memory scheduler • Predictability with memory interference • Coordinated cache/memory management for performance • Cache slowdown estimation • Coordinated cache/memory management for predictability 57 Timeline Apr. May Jun. Jul. 2014 2015 Aug. Sep. Oct. Nov. Dec. Jan. Feb. Mar. Apr. May Cache slowdown estimation (75% Goal) Coordinated cache/memory management for performance (100% Goal) Coordinated cache/memory management for predictability (125% Goal) Writeup thesis and defend 58 Summary • Problem: Shared resource interference causes high and unpredictable application slowdowns • Goals: High and predictable performance • Our Approach: – Simple mechanisms to mitigate interference – Slowdown estimation models – Coordinated cache/memory management 59