18-447 Computer Architecture Lecture 31: Predictable Performance Lavanya Subramanian Carnegie Mellon University Spring 2015, 4/15/2015 Shared Resource Interference Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Shared Cache Main Memory 2 6 6 5 5 Slowdown Slowdown High and Unpredictable Applica:on Slowdowns 4 3 2 1 0 4 3 2 1 0 leslie3d (core 0) gcc (core 1) leslie3d (core 0) mcf (core 1) Haigh applica:on slowdowns ddepends ue to 2. 1. An pplica:on’s performance resource interference on wshared hich applica:on t is running with 3 Need for Predictable Performance • There is a need for predictable performance – When mul:ple applica:ons share resources – Especially if some applica:ons require performance guarantees • Example 1: In virtualized systems Our Goal: Predictable performance – Different users’ jobs consolidated onto the same server – Need to mp eet performance of cri:cal jobs in the resence of rsequirements hared resources • Example 2: In mobile systems – Interac:ve applica:ons run with non-­‐interac:ve applica:ons – Need to guarantee performance for interac:ve applica:ons 4 Tackling Different Parts of the Shared Memory Hierarchy Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Shared Cache Main Memory 5 Predictability in the Presence of Memory Bandwidth Interference Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Shared Cache Main Memory 6 Predictability in the Presence of Memory Bandwidth Interference (HPCA 2013) 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 7 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 8 Slowdown: Defini:on Performance Alone Slowdown = Performance Shared 9 Key Observa:on 1 Normalized Performance For a memory bound applica:on, Performance ∝ Memory request service rate 1 0.9 0.8 omnetpp mcf Difficult astar Request Service Performanc e AloneRate Alone Intel Core i7, 4 cores Slowdown0.6 = Performanc e SharedRate Shared Request Service 0.5 0.7 0.4 Easy 0.3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Request Service Rate 10 Key Observa:on 2 Request Service Rate Alone (RSRAlone) of an applica:on can be es:mated by giving the applica:on highest priority at the memory controller Highest priority à Ligle interference (almost as if the applica:on were run alone) 11 Key Observa:on 2 1. Run alone Time units Request Buffer State Main Memory 3 2. Run with another applica=on Time units Request Buffer State Main Memory 3 Service order 2 1 Main Memory Service order 2 1 Main Memory 3. Run with another applica=on: highest priority Time units Request Buffer State Main Memory 3 Service order 2 1 Main Memory 12 Memory Interference-­‐induced Slowdown Es:ma:on (MISE) model for memory bound applica:ons Request Service Rate Alone (RSRAlone) Slowdown = Request Service Rate Shared (RSRShared) 13 Key Observa:on 3 • Memory-­‐bound applica:on No interference With interference Req Compute Phase Memory Phase Req Req :me Req Req Req :me Memory phase slowdown dominates overall slowdown 14 Key Observa:on 3 Compute Phase • Non-­‐memory-­‐bound applica:on Memory Phase Memory Interference-­‐induced Slowdown Es:ma:on α 1−α (MISE) model for non-­‐memory bound applica:ons No interference RSRAlone :me Slowdown = (1 - α ) + α RSRShared With interference 1−α RSRAlone α RSRShared :me Only memory frac:on (α ) slows down with interference 15 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 16 Interval Based Opera:on Interval Interval :me n n Measure RSRShared, α Es:mate RSRAlone n n Measure RSRShared, α Es:mate RSRAlone Es:mate slowdown Es:mate slowdown 17 Measuring RSRShared and α • Request Service Rate Shared (RSRShared) – Per-­‐core counter to track number of requests serviced – At the end of each interval, measure Number of Requests Served RSRShared = Interval Length • Memory Phase Frac:on (α ) – Count number of stall cycles at the core – Compute frac:on of cycles stalled for memory 18 Es:ma:ng Request Service Rate Alone (RSRAlone) • Divide each interval into shorter epochs Goal: Es:mate RSRAlone • At the beginning of each epoch How: Periodically give eaach – Randomly pick an applica:on s the a hpplica:on ighest priority applica:on highest priority in accessing memory • At the end of an interval, for each applica:on, es:mate Number of Requests During High Priority Epochs RSRAlone = Number of Cycles Application Given High Priority 19 Inaccuracy in Es:ma:ng RSRAlone • When an applica:on has highest priority High Priority Service order Time iunterference nits Request Buffer – S:ll experiences some 3 2 1 State Main Memory Request Buffer State Request Buffer State Time units Main Memory 3 Time units Main Memory 3 Main Memory Service order 2 1 Main Memory Service order 2 1 Interference Cycles Main Memory 20 Accoun:ng for Interference in RSRAlone Es:ma:on • Solu:on: Determine and remove interference cycles from ARSR calcula:on ARSR = Number of Requests During High Priority Epochs Number of Cycles Application Given High Priority - Interference Cycles • A cycle is an interference cycle if – a request from the highest priority applica:on is wai:ng in the request buffer and – another applica:on’s request was issued previously 21 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 22 MISE Opera:on: Pu]ng it All Together Interval Interval :me n n Measure RSRShared, α Es:mate RSRAlone n n Measure RSRShared, α Es:mate RSRAlone Es:mate slowdown Es:mate slowdown 23 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 24 Previous Work on Slowdown Es:ma:on • Previous work on slowdown es:ma:on – STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07] – FST (Fairness via Source Throgling) [Ebrahimi et al., ASPLOS ‘10] – Per-­‐thread Cycle Accoun=ng [Du Bois et al., HiPEAC ‘13] • Basic Idea: Stall Time Alone Slowdown = Stall Time Shared Difficult Easy Count number of cycles applica:on receives interference 25 Two Major Advantages of MISE Over STFM • Advantage 1: – STFM es:mates alone performance while an applica:on is receiving interference à Difficult – MISE es:mates alone performance while giving an applica:on the highest priority à Easier • Advantage 2: – STFM does not take into account compute phase for non-­‐memory-­‐bound applica:ons – MISE accounts for compute phase à Beger accuracy 26 Methodology • Configura:on of our simulated system – 4 cores – 1 channel, 8 banks/channel – DDR3 1066 DRAM – 512 KB private cache/core • Workloads – SPEC CPU2006 – 300 mul: programmed workloads 27 Quan:ta:ve Comparison SPEC CPU 2006 applica:on leslie3d 4 Slowdown 3.5 3 Actual 2.5 STFM 2 MISE 1.5 1 0 20 40 60 80 100 Million Cycles 28 4 4 3 3 3 2 1 2 1 4 Average of MISE: 80 .2% 50 0 error 50 100 100 cactusADM GemsFDTD soplex Average error of STFM: 29.4% 4 4 (across 3 00 w orkloads) 3 3 Slowdown 3 2 1 0 0 1 50 50 wrf 100 Slowdown 0 2 0 0 0 Slowdown Slowdown 4 Slowdown Slowdown Comparison to STFM 2 1 0 0 50 calculix 100 100 2 1 0 0 50 povray 100 29 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 30 Possible Use Cases • Bounding applica2on slowdowns [HPCA ’14] • VM migra2on and admission control schemes [VEE ’15] • Fair billing schemes in a commodity cloud 31 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 32 MISE-­‐QoS: Providing “So^” Slowdown Guarantees • Goal 1. Ensure QoS-­‐cri:cal applica:ons meet a prescribed slowdown bound 2. Maximize system performance for other applica:ons • Basic Idea – Allocate just enough bandwidth to QoS-­‐cri:cal applica:on – Assign remaining bandwidth to other applica:ons 33 Methodology • Each applica:on (25 applica:ons in total) considered the QoS-­‐cri:cal applica:on • Run with 12 sets of co-­‐runners of different memory intensi:es • Total of 300 mul: programmed workloads • Each workload run with 10 slowdown bound values • Baseline memory scheduling mechanism – Always priori:ze QoS-­‐cri:cal applica:on [Iyer et al., SIGMETRICS 2007] – Other applica:ons’ requests scheduled in FR-­‐FCFS order [Zuravleff and Robinson, US Patent 1997, Rixner+, ISCA 2000] 34 A Look at One Workload Slowdown Bound = 10 Slowdown Bound = 3.33 Slowdown Bound = 2 3 2.5 Slowdown 2 is effec:ve in AlwaysPrioritize MISE MISE-QoS-10/1 1. 1.5 mee:ng the slowdown bound for the QoS-­‐cri:cal MISE-QoS-10/3 MISE-QoS-10/5 applica:on 1 MISE-QoS-10/7 2. improving performance of non-­‐QoS-­‐cri:cal MISE-QoS-10/9 0.5 applica:ons 0 leslie3d hmmer QoS-­‐cri=cal lbm omnetpp non-­‐QoS-­‐cri=cal 35 Effec:veness of MISE in Enforcing QoS Across 3000 data points QoS Bound Met Predicted Met Predicted Not Met 78.8% 2.1% QoS Bound Not Met 2.2% 16.9% MISE-­‐QoS meets the bound for 80.9% of workloads MISE-­‐QoS correctly predicts whether or not the bound is of workloads met for 95.7% AlwaysPriori:ze meets the bound for 83% of workloads 36 Performance of Non-­‐QoS-­‐Cri:cal Applica:ons System Performance 1 0.8 0.6 0.4 0.2 AlwaysPrioritize MISE-QoS-10/1 MISE-QoS-10/3 MISE-QoS-10/5 MISE-QoS-10/7 MISE-QoS-10/9 0 When slowdown ound is 10/3 Higher performance wb hen bound is loose MISE-­‐QoS improves system performance by 10% 37 Summary: Predictability in the Presence of Memory Bandwidth Interference • Uncontrolled memory interference slows down applica:ons unpredictably • Goal: Es:mate and control slowdowns • Key contribu:on – MISE: An accurate slowdown es:ma:on model – Average error of MISE: 8.2% • Key Idea – Request Service Rate is a proxy for performance • Leverage slowdown es:mates to control slowdowns; Many more applica:ons exist 38 Taking Into Account Shared Cache Interference Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Shared Cache Main Memory 39 Revisi:ng Request Service Rates Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Memory Access Rate Shared Cache Main Memory Request Service Rate Request service and access rates :ghtly coupled 40 Es:ma:ng Cache and Memory Slowdowns Through Cache Access Rates Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Access Rate Shared Cache Main Memory 41 The Applica:on Slowdown Model Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Access Rate Shared Cache Main Memory Cache Access Rate Alone Slowdown = Cache Access Rate Shared 42 Real System Studies: Cache Access Rate vs. Slowdown 2.2 Slowdown 2 1.8 1.6 astar 1.4 lbm 1.2 bzip2 1 1 1.2 1.4 1.6 1.8 2 2.2 Cache Access Rate Ra=o 43 Challenge How to es2mate alone cache access rate? Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Access Rate Shared Cache Auxiliary Tag Store Main Memory Priority 44 Auxiliary Tag Store Core Cache Access Rate Shared Cache Core Auxiliary Tag Store Main Memory Priority S/ll in auxiliary tag store Auxiliary Store Auxiliary tag sTag tore tracks such conten/on misses 45 Revisi:ng Request Service Rate Alone • Revisi:ng alone memory request service rate Alone Request Service Rate of an Application = # Requests During High Priority Epochs # High Priority Cycles - # Interference Cycles Cycles serving conten2on misses are not high priority cycles 46 Cache Access Rate Alone Alone Cache Access Rate of an Application = # Requests During High Priority Epochs # High Priority Cycles - #Interference Cycles - #Cache Contention Cycles Cache Conten2on Cycles: Cycles spent serving conten2on misses Cache Contention Cycles = # Contention Misses x Average Memory Service Time From auxiliary tag store when given high priority Measured when given high priority 47 Applica:on Slowdown Model (ASM) Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache Access Rate Shared Cache Main Memory Cache Access Rate Alone Slowdown = Cache Access Rate Shared 48 Previous Work on Slowdown Es:ma:on • Previous work on slowdown es:ma:on – STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07] – FST (Fairness via Source Throgling) [Ebrahimi et al., ASPLOS ‘10] – Per-­‐thread Cycle Accoun=ng [Du Bois et al., HiPEAC ‘13] • Basic Idea: Difficult Execution Time Alone Slowdown = Execution Time Shared Easy Count number of cycles applica:on receives interference 49 0 PTCA Average FST NPBbt NPB^ NPBis NPBua 160 calculix povray tonto namd dealII sjeng perlbenc gobmk xalancb sphinx3 GemsFD omnetpp lbm leslie3d soplex milc libq mcf Slowdown Es=ma=on Error (in %) Model Accuracy Results ASM 140 120 100 80 60 40 20 Select applica2ons Average error of ASM’s slowdown es2mates: 10% 50 Leveraging Slowdown Es:mates for Performance Op:miza:on • How do we leverage slowdown es:mates from our model? • To achieve high performance – Slowdown-­‐aware cache alloca:on – Slowdown-­‐aware bandwidth alloca:on • To achieve performance predictability? 51 Cache Capacity Par::oning Core Cache Access Rate Shared Cache Main Memory Core Goal: Par22on the shared cache among applica2ons to mi2gate conten2on 52 Cache Capacity Par::oning Way Way Way Way 1 0 2 3 Core Set 0 Set 1 Set 2 Set 3 .. Core Set N Main Memory Previous way par22oning schemes op2mize for miss count Problem: Not aware of performance and slowdowns 53 ASM-­‐Cache: Slowdown-­‐aware Cache Way Par::oning • Key Requirement: Slowdown es2mates for all possible way par22ons • Extend ASM to es2mate slowdown for all possible cache way alloca2ons • Key Idea: Allocate each way to the applica2on whose slowdown reduces the most 54 Performance and Fairness Results 0.8 Performance Fairness (Lower is bePer) 15 10 5 0 0.6 NoPart 0.4 UCP 0.2 ASM-­‐Cache 0 4 8 16 Number of Cores 4 8 16 Number of Cores Significant fairness benefits across different systems 55 Memory Bandwidth Par::oning Core Cache Access Rate Shared Cache Main Memory Core Goal: Par22on the main memory bandwidth among applica2ons to mi2gate conten2on 56 ASM-­‐Mem: Slowdown-­‐aware Memory Bandwidth Par::oning • Key Idea: Allocate high priority propor2onal to an applica2on’s slowdown Slowdown i High Priority Fraction i = ∑ Slowdown j j • Applica2on i’s requests given highest priority at the memory controller for its frac2on 57 20 0.8 Performance Fairness (Lower is bePer) ASM-­‐Mem: Fairness and Performance Results 15 10 5 0 0.6 FRFCFS 0.4 TCM 0.2 PARBS 0 4 8 16 Number of Cores 4 8 16 Number of Cores ASM-­‐Mem Significant fairness benefits across different systems 58 Coordinated Resource Alloca:on Schemes Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Cache capacity-­‐aware bandwidth alloca:on Shared Cache Main Memory 1. Employ ASM-­‐Cache to par22on cache capacity 2. Drive ASM-­‐Mem with slowdowns from ASM-­‐Cache 59 Fairness and Performance Results 11 0.35 10 0.3 9 0.25 Performance Fairness (Lower is bePer) 16-­‐core system 8 7 6 FRFCFS+UCP TCM+UCP 0.2 PARBS+UCP 0.15 ASM-­‐Cache-­‐Mem 0.1 5 0.05 4 0 1 2 Number of Channels FRFCFS-­‐NoPart 1 2 Number of Channels Significant fairness benefits across different channel counts 60 Other Possible Applica:ons • VM migra2on and admission control schemes [VEE ’15] • Fair billing schemes in a commodity cloud • Bounding applica2on slowdowns 61 Summary: Predictability in the Presence of Shared Cache Interference • Key Ideas: – Cache access rate is a proxy for performance – Auxiliary tag stores and high priority can be combined to es:mate slowdowns • Key Result: Slowdown es:ma:on error -­‐ ~10% • Some Applica:ons: – Slowdown-­‐aware cache par::oning – Slowdown-­‐aware memory bandwidth par::oning – Many more possible 62 Future Work: Coordinated Resource Management for Predictable Performance Goal: Cache capacity and memory bandwidth alloca:on for an applica:on to meet a bound Challenges: • Large search space of poten:al cache capacity and memory bandwidth alloca:ons • Mul:ple possible combina:ons of cache/ memory alloca:ons for each applica:on 63 18-447 Computer Architecture Lecture 31: Predictable Performance Lavanya Subramanian Carnegie Mellon University Spring 2015, 4/15/2015