18-447 Computer Architecture Lecture 31: Predictable Performance

advertisement
18-447
Computer Architecture
Lecture 31: Predictable Performance
Lavanya Subramanian
Carnegie Mellon University
Spring 2015, 4/15/2015
Shared Resource Interference Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
2 6
6
5
5
Slowdown
Slowdown
High and Unpredictable Applica:on Slowdowns 4
3
2
1
0
4
3
2
1
0
leslie3d (core 0)
gcc (core 1)
leslie3d (core 0)
mcf (core 1)
Haigh applica:on slowdowns ddepends ue to 2. 1. An pplica:on’s performance resource interference on wshared hich applica:on t is running with 3 Need for Predictable Performance •  There is a need for predictable performance –  When mul:ple applica:ons share resources –  Especially if some applica:ons require performance guarantees •  Example 1: In virtualized systems Our Goal: Predictable performance –  Different users’ jobs consolidated onto the same server –  Need to mp
eet performance of cri:cal jobs in the resence of rsequirements hared resources •  Example 2: In mobile systems –  Interac:ve applica:ons run with non-­‐interac:ve applica:ons –  Need to guarantee performance for interac:ve applica:ons 4 Tackling Different Parts of the Shared Memory Hierarchy Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
5 Predictability in the Presence of Memory Bandwidth Interference Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
6 Predictability in the Presence of Memory Bandwidth Interference (HPCA 2013) 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 7 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 8 Slowdown: Defini:on Performance Alone
Slowdown =
Performance Shared
9 Key Observa:on 1 Normalized Performance For a memory bound applica:on, Performance ∝ Memory request service rate 1 0.9 0.8 omnetpp mcf Difficult astar Request
Service
Performanc
e AloneRate Alone
Intel Core i7, 4 cores Slowdown0.6 =
Performanc
e
SharedRate Shared
Request
Service
0.5 0.7 0.4 Easy 0.3 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Request Service Rate 10 Key Observa:on 2 Request Service Rate Alone (RSRAlone) of an applica:on can be es:mated by giving the applica:on highest priority at the memory controller Highest priority à Ligle interference (almost as if the applica:on were run alone) 11 Key Observa:on 2 1. Run alone Time units Request Buffer State Main Memory 3
2. Run with another applica=on Time units Request Buffer State Main Memory 3
Service order 2
1
Main Memory Service order 2
1
Main Memory 3. Run with another applica=on: highest priority Time units Request Buffer State Main Memory 3
Service order 2
1
Main Memory 12 Memory Interference-­‐induced Slowdown Es:ma:on (MISE) model for memory bound applica:ons Request Service Rate Alone (RSRAlone)
Slowdown =
Request Service Rate Shared (RSRShared)
13 Key Observa:on 3 •  Memory-­‐bound applica:on No interference With interference Req Compute Phase Memory Phase Req Req :me Req Req Req :me Memory phase slowdown dominates overall slowdown 14 Key Observa:on 3 Compute Phase •  Non-­‐memory-­‐bound applica:on Memory Phase Memory Interference-­‐induced Slowdown Es:ma:on α
1−α
(MISE) model for non-­‐memory bound applica:ons No interference RSRAlone
:me Slowdown = (1 - α ) + α
RSRShared
With interference 1−α
RSRAlone
α
RSRShared
:me Only memory frac:on (α ) slows down with interference 15 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 16 Interval Based Opera:on Interval Interval :me n 
n 
Measure RSRShared, α
Es:mate RSRAlone n 
n 
Measure RSRShared, α
Es:mate RSRAlone Es:mate slowdown Es:mate slowdown 17 Measuring RSRShared and α •  Request Service Rate Shared (RSRShared) –  Per-­‐core counter to track number of requests serviced –  At the end of each interval, measure Number of Requests Served
RSRShared =
Interval Length
•  Memory Phase Frac:on (α ) –  Count number of stall cycles at the core –  Compute frac:on of cycles stalled for memory 18 Es:ma:ng Request Service Rate Alone (RSRAlone) •  Divide each interval into shorter epochs Goal: Es:mate RSRAlone •  At the beginning of each epoch How: Periodically give eaach –  Randomly pick an applica:on s the a
hpplica:on ighest priority applica:on highest priority in accessing memory •  At the end of an interval, for each applica:on, es:mate Number of Requests During High Priority Epochs
RSRAlone =
Number of Cycles Application Given High Priority
19 Inaccuracy in Es:ma:ng RSRAlone •  When an applica:on has highest priority High Priority Service order Time iunterference nits Request Buffer –  S:ll experiences some 3
2
1
State Main Memory Request Buffer State Request Buffer State Time units Main Memory 3
Time units Main Memory 3
Main Memory Service order 2
1
Main Memory Service order 2
1
Interference Cycles Main Memory 20 Accoun:ng for Interference in RSRAlone Es:ma:on •  Solu:on: Determine and remove interference cycles from ARSR calcula:on ARSR
=
Number of Requests During High Priority Epochs
Number of Cycles Application Given High Priority - Interference Cycles
•  A cycle is an interference cycle if –  a request from the highest priority applica:on is wai:ng in the request buffer and –  another applica:on’s request was issued previously 21 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 22 MISE Opera:on: Pu]ng it All Together Interval Interval :me n 
n 
Measure RSRShared, α
Es:mate RSRAlone n 
n 
Measure RSRShared, α
Es:mate RSRAlone Es:mate slowdown Es:mate slowdown 23 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 24 Previous Work on Slowdown Es:ma:on •  Previous work on slowdown es:ma:on –  STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07] –  FST (Fairness via Source Throgling) [Ebrahimi et al., ASPLOS ‘10] –  Per-­‐thread Cycle Accoun=ng [Du Bois et al., HiPEAC ‘13] •  Basic Idea: Stall Time Alone
Slowdown =
Stall Time Shared
Difficult Easy Count number of cycles applica:on receives interference 25 Two Major Advantages of MISE Over STFM •  Advantage 1: –  STFM es:mates alone performance while an applica:on is receiving interference à Difficult –  MISE es:mates alone performance while giving an applica:on the highest priority à Easier •  Advantage 2: –  STFM does not take into account compute phase for non-­‐memory-­‐bound applica:ons –  MISE accounts for compute phase à Beger accuracy 26 Methodology •  Configura:on of our simulated system –  4 cores –  1 channel, 8 banks/channel –  DDR3 1066 DRAM –  512 KB private cache/core •  Workloads –  SPEC CPU2006 –  300 mul: programmed workloads 27 Quan:ta:ve Comparison SPEC CPU 2006 applica:on leslie3d 4 Slowdown
3.5 3 Actual
2.5 STFM
2 MISE
1.5 1 0 20 40 60 80 100 Million Cycles
28 4
4
3
3
3
2
1
2
1
4
Average of MISE: 80 .2% 50
0 error 50
100
100
cactusADM GemsFDTD soplex Average error of STFM: 29.4% 4
4
(across 3
00 w
orkloads) 3
3
Slowdown
3
2
1
0
0
1
50
50
wrf 100
Slowdown
0
2
0
0
0
Slowdown
Slowdown
4
Slowdown
Slowdown
Comparison to STFM 2
1
0
0
50
calculix 100
100
2
1
0
0
50
povray 100
29 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 30 Possible Use Cases •  Bounding applica2on slowdowns [HPCA ’14] •  VM migra2on and admission control schemes [VEE ’15] •  Fair billing schemes in a commodity cloud 31 Predictability in the Presence of Memory Bandwidth Interference 1. Es:mate Slowdown – Key Observa:ons – Implementa:on – MISE Model: Pu]ng it All Together – Evalua:ng the Model 2. Control Slowdown – Providing So^ Slowdown Guarantees 32 MISE-­‐QoS: Providing “So^” Slowdown Guarantees •  Goal 1. Ensure QoS-­‐cri:cal applica:ons meet a prescribed slowdown bound 2. Maximize system performance for other applica:ons •  Basic Idea –  Allocate just enough bandwidth to QoS-­‐cri:cal applica:on –  Assign remaining bandwidth to other applica:ons 33 Methodology •  Each applica:on (25 applica:ons in total) considered the QoS-­‐cri:cal applica:on •  Run with 12 sets of co-­‐runners of different memory intensi:es •  Total of 300 mul: programmed workloads •  Each workload run with 10 slowdown bound values •  Baseline memory scheduling mechanism –  Always priori:ze QoS-­‐cri:cal applica:on [Iyer et al., SIGMETRICS 2007] –  Other applica:ons’ requests scheduled in FR-­‐FCFS order [Zuravleff and Robinson, US Patent 1997, Rixner+, ISCA 2000] 34 A Look at One Workload Slowdown Bound = 10 Slowdown Bound = 3.33 Slowdown Bound = 2 3
2.5
Slowdown
2 is effec:ve in AlwaysPrioritize
MISE MISE-QoS-10/1
1.  1.5
mee:ng the slowdown bound for the QoS-­‐cri:cal MISE-QoS-10/3
MISE-QoS-10/5
applica:on 1
MISE-QoS-10/7
2.  improving performance of non-­‐QoS-­‐cri:cal MISE-QoS-10/9
0.5
applica:ons 0
leslie3d
hmmer
QoS-­‐cri=cal lbm
omnetpp
non-­‐QoS-­‐cri=cal 35 Effec:veness of MISE in Enforcing QoS Across 3000 data points QoS Bound Met Predicted Met Predicted Not Met 78.8% 2.1% QoS Bound Not Met 2.2% 16.9% MISE-­‐QoS meets the bound for 80.9% of workloads MISE-­‐QoS correctly predicts whether or not the bound is of workloads met for 95.7% AlwaysPriori:ze meets the bound for 83% of workloads 36 Performance of Non-­‐QoS-­‐Cri:cal Applica:ons System
Performance
1
0.8
0.6
0.4
0.2
AlwaysPrioritize
MISE-QoS-10/1
MISE-QoS-10/3
MISE-QoS-10/5
MISE-QoS-10/7
MISE-QoS-10/9
0
When slowdown ound is 10/3 Higher performance wb
hen bound is loose MISE-­‐QoS improves system performance by 10% 37 Summary: Predictability in the Presence of Memory Bandwidth Interference •  Uncontrolled memory interference slows down applica:ons unpredictably •  Goal: Es:mate and control slowdowns •  Key contribu:on –  MISE: An accurate slowdown es:ma:on model –  Average error of MISE: 8.2% •  Key Idea –  Request Service Rate is a proxy for performance •  Leverage slowdown es:mates to control slowdowns; Many more applica:ons exist 38 Taking Into Account Shared Cache Interference Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
39 Revisi:ng Request Service Rates Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Memory Access Rate Shared
Cache
Main
Memory
Request Service Rate Request service and access rates :ghtly coupled 40 Es:ma:ng Cache and Memory Slowdowns Through Cache Access Rates Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache Access Rate Shared
Cache
Main
Memory
41 The Applica:on Slowdown Model Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache Access Rate Shared
Cache
Main
Memory
Cache Access Rate Alone
Slowdown =
Cache Access Rate Shared
42 Real System Studies: Cache Access Rate vs. Slowdown 2.2 Slowdown 2 1.8 1.6 astar 1.4 lbm 1.2 bzip2 1 1 1.2 1.4 1.6 1.8 2 2.2 Cache Access Rate Ra=o 43 Challenge How to es2mate alone cache access rate? Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache Access Rate Shared
Cache
Auxiliary
Tag Store
Main
Memory
Priority
44 Auxiliary Tag Store Core Cache Access Rate Shared
Cache
Core Auxiliary
Tag Store
Main
Memory
Priority
S/ll in auxiliary tag store Auxiliary
Store
Auxiliary tag sTag
tore tracks such conten/on misses 45 Revisi:ng Request Service Rate Alone •  Revisi:ng alone memory request service rate Alone Request Service Rate of an Application =
# Requests During High Priority Epochs
# High Priority Cycles - # Interference Cycles
Cycles serving conten2on misses are not high priority cycles 46 Cache Access Rate Alone Alone Cache Access Rate of an Application =
# Requests During High Priority Epochs
# High Priority Cycles - #Interference Cycles - #Cache Contention Cycles
Cache Conten2on Cycles: Cycles spent serving conten2on misses Cache Contention Cycles = # Contention Misses x
Average Memory Service Time
From auxiliary tag store when given high priority Measured when given high priority 47 Applica:on Slowdown Model (ASM) Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache Access Rate Shared
Cache
Main
Memory
Cache Access Rate Alone
Slowdown =
Cache Access Rate Shared
48 Previous Work on Slowdown Es:ma:on •  Previous work on slowdown es:ma:on –  STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07] –  FST (Fairness via Source Throgling) [Ebrahimi et al., ASPLOS ‘10] –  Per-­‐thread Cycle Accoun=ng [Du Bois et al., HiPEAC ‘13] •  Basic Idea: Difficult Execution Time Alone
Slowdown =
Execution Time Shared
Easy Count number of cycles applica:on receives interference 49 0 PTCA Average FST NPBbt NPB^ NPBis NPBua 160 calculix povray tonto namd dealII sjeng perlbenc
gobmk xalancb
sphinx3 GemsFD
omnetpp lbm leslie3d soplex milc libq mcf Slowdown Es=ma=on Error (in %) Model Accuracy Results ASM 140 120 100 80 60 40 20 Select applica2ons Average error of ASM’s slowdown es2mates: 10% 50 Leveraging Slowdown Es:mates for Performance Op:miza:on •  How do we leverage slowdown es:mates from our model? •  To achieve high performance –  Slowdown-­‐aware cache alloca:on –  Slowdown-­‐aware bandwidth alloca:on •  To achieve performance predictability? 51 Cache Capacity Par::oning Core Cache Access Rate Shared
Cache
Main
Memory
Core Goal: Par22on the shared cache among applica2ons to mi2gate conten2on 52 Cache Capacity Par::oning Way Way Way Way 1 0 2 3 Core Set 0 Set 1 Set 2 Set 3 .. Core Set N Main
Memory
Previous way par22oning schemes op2mize for miss count Problem: Not aware of performance and slowdowns 53 ASM-­‐Cache: Slowdown-­‐aware Cache Way Par::oning •  Key Requirement: Slowdown es2mates for all possible way par22ons •  Extend ASM to es2mate slowdown for all possible cache way alloca2ons •  Key Idea: Allocate each way to the applica2on whose slowdown reduces the most 54 Performance and Fairness Results 0.8 Performance Fairness (Lower is bePer) 15 10 5 0 0.6 NoPart 0.4 UCP 0.2 ASM-­‐Cache 0 4 8 16 Number of Cores 4 8 16 Number of Cores Significant fairness benefits across different systems 55 Memory Bandwidth Par::oning Core Cache Access Rate Shared
Cache
Main
Memory
Core Goal: Par22on the main memory bandwidth among applica2ons to mi2gate conten2on 56 ASM-­‐Mem: Slowdown-­‐aware Memory Bandwidth Par::oning •  Key Idea: Allocate high priority propor2onal to an applica2on’s slowdown Slowdown i
High Priority Fraction i =
∑ Slowdown j
j
•  Applica2on i’s requests given highest priority at the memory controller for its frac2on 57 20 0.8 Performance Fairness (Lower is bePer) ASM-­‐Mem: Fairness and Performance Results 15 10 5 0 0.6 FRFCFS 0.4 TCM 0.2 PARBS 0 4 8 16 Number of Cores 4 8 16 Number of Cores ASM-­‐Mem Significant fairness benefits across different systems 58 Coordinated Resource Alloca:on Schemes Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache capacity-­‐aware bandwidth alloca:on Shared
Cache
Main
Memory
1. Employ ASM-­‐Cache to par22on cache capacity 2. Drive ASM-­‐Mem with slowdowns from ASM-­‐Cache 59 Fairness and Performance Results 11 0.35 10 0.3 9 0.25 Performance Fairness (Lower is bePer) 16-­‐core system 8 7 6 FRFCFS+UCP TCM+UCP 0.2 PARBS+UCP 0.15 ASM-­‐Cache-­‐Mem 0.1 5 0.05 4 0 1 2 Number of Channels FRFCFS-­‐NoPart 1 2 Number of Channels Significant fairness benefits across different channel counts 60 Other Possible Applica:ons •  VM migra2on and admission control schemes [VEE ’15] •  Fair billing schemes in a commodity cloud •  Bounding applica2on slowdowns 61 Summary: Predictability in the Presence of Shared Cache Interference •  Key Ideas: –  Cache access rate is a proxy for performance –  Auxiliary tag stores and high priority can be combined to es:mate slowdowns •  Key Result: Slowdown es:ma:on error -­‐ ~10% •  Some Applica:ons: –  Slowdown-­‐aware cache par::oning –  Slowdown-­‐aware memory bandwidth par::oning –  Many more possible 62 Future Work: Coordinated Resource Management for Predictable Performance Goal: Cache capacity and memory bandwidth alloca:on for an applica:on to meet a bound Challenges: •  Large search space of poten:al cache capacity and memory bandwidth alloca:ons •  Mul:ple possible combina:ons of cache/
memory alloca:ons for each applica:on 63 18-447
Computer Architecture
Lecture 31: Predictable Performance
Lavanya Subramanian
Carnegie Mellon University
Spring 2015, 4/15/2015
Download