Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems

advertisement
Providing High and Predictable Performance
in Multicore Systems
Through Shared Resource Management
Lavanya Subramanian
1
Shared Resource Interference
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
2
6
6
5
5
Slowdown
Slowdown
High and Unpredictable
Application Slowdowns
4
3
2
1
0
4
3
2
1
0
leslie3d (core 0)
gcc (core 1)
leslie3d (core 0)
mcf (core 1)
application
slowdowns depends
due to
2.1.
AnHigh
application’s
performance
shared
resource interference
on which
application
it is running with
3
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
4
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
5
Background: Main Memory
Row-buffer miss
hit
Memory
Controller
Rows
Columns
Bank 0
Bank 1
Bank 2
Bank 3
Row
Row
Buffer
Buffer
Row
Buffer
Row
Buffer
Row
Buffer
Channel
• FR-FCFS Memory Scheduler
[Zuravleff and Robinson, US Patent ‘97; Rixner et al., ISCA ‘00]
– Row-buffer hit first
– Older request first
• Unaware of inter-application interference
6
Tackling Inter-Application Interference:
Memory Request Scheduling
• Monitor application memory access
characteristics
• Rank applications based on memory access
characteristics
• Prioritize requests at the memory
controller, based on ranking
7
An Example:
Thread Cluster Memory Scheduling
Memory-non-intensive
thread
Throughput
thread
thread
thread
thread
thread
Non-intensive
cluster
higher
priority
higher
priority
Prioritized
thread
Threads in the system
Memory-intensive
Intensive cluster
Fairness
Figure: Kim et al., MICRO 2010
8
Problems with Previous
Application-aware Memory Schedulers
• Hardware Complexity
– Ranking incurs high hardware cost
• Unfair slowdowns of some applications
– Ranking causes unfairness
9
High Hardware Complexity
• Ranking incurs high hardware cost
– Rank computation incurs logic/storage cost
– Rank enforcement requires comparison logic
80000
8
6
4
2
FRFCFS
8x
TCM
Area (in square um)
Latency (in ns)
10
60000
1.8x
40000
FRFCFS
TCM
20000
Avoid ranking to achieve low
hardware cost
0
0
Synthesized with a 32nm standard cell library
10
astar
soplex
•
Lower-rank
applications
experience
40
80
30 significant slowdowns
60
50
Number of Requests
Number of Requests
Ranking Causes
Unfair Application Slowdowns
100
40
TCM
– Low memory service
causes
slowdown
20
Grouping
– Periodic rank shuffling not0sufficient
20
10
0
0
50
100
0
12
10
10
8
8
6
TCM
Grouping
Grouping
100
Execution Time (in 1000s of Cycles)
Slowdown
Slowdown
Execution Time (in 1000s of Cycles)
50
TCM
6
TCM
4
Grouping
Grouping offers lower unfairness
than
ranking
2
4
2
0
0
11
Problems with Previous
Application-Aware Memory Schedulers
• Hardware Complexity
– Ranking incurs high hardware cost
• Unfair slowdowns of some applications
– Ranking causes unfairness
Our Goal: Design a memory scheduler with
Low Complexity, High Performance, and Fairness
12
Towards a New Scheduler Design
• Monitor applications that have a number of
consecutive requests served
1. Simple Grouping Mechanism
• Blacklist such applications
• Prioritize requests of non-blacklisted applications
2. Enforcing Priorities Based On Grouping
• Periodically clear blacklists
13
Methodology
• Configuration of our simulated system
–
–
–
–
24 cores
4 channels, 8 banks/channel
DDR3 1066 DRAM
512 KB private cache/core
• Workloads
– SPEC CPU2006, TPCC, Matlab
– 80 multi programmed workloads
14
Metrics
• System Performance:
N
Harmonic Speedup 
IPCialone
 IPCshared
i
i
Slowdown
IPCishared
Weighted Speedup  
alone
i IPCi
10
6
4
2
0
0
0.5
1
Speedup
• Fairness:
Maximum Slowdown  max
8
alone
i
shared
i
IPC
IPC
15
Previous Memory Schedulers
• FR-FCFS [Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000]
– Prioritizes row-buffer hits and older requests
– Application-unaware
• PARBS [Mutlu and Moscibroda, ISCA 2008]
– Batches oldest requests from each application; prioritizes batch
– Employs ranking within a batch
• ATLAS [Kim et al., HPCA 2010]
– Prioritizes applications with low memory-intensity
• TCM [Kim et al., MICRO 2010]
– Always prioritizes low memory-intensity applications
– Shuffles request priorities of high memory-intensity applications
16
Performance Results
8
15
FRFCFS
FRFCFS-Cap
6
PARBS
4
ATLAS
TCM
2
0
Blacklisting
Maximum Slowdown
Weighted Speedup
10
FRFCFS
10
FRFCFS-Cap
PARBS
ATLAS
5
TCM
Blacklisting
0
Approaches fairness of PARBS and
FRFCFS-Cap
achieving better
5% higher
system performance
and 25%
performance
than TCM
lower maximum
slowdown
than TCM
17
Complexity Results
Latency (in ns)
10
8
6
120000
FRFCFS
FRFCFS-Cap
PARBS
ATLAS
4
TCM
2
Blacklisting
0
Area (in square um)
12
100000
80000
60000
FRFCFS
FRFCFS-Cap
PARBS
ATLAS
40000
TCM
20000
Blacklisting
0
Blacklisting achieves
43%lower
lowerlatency
area than
70%
thanTCM
TCM
18
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
19
Need for Predictable Performance
• There is a need for predictable performance
– When multiple applications share resources
– Especially if some applications require performance
guarantees
• Example
1: In step:
server systems
As a first
Predictable
performance
– Different users’ jobs consolidated onto the same server
in
thetopresence
of memory
– Need
provide bounded
slowdowns tointerference
critical jobs
• Example 2: In mobile systems
– Interactive applications run with non-interactive applications
– Need to guarantee performance for interactive applications
20
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
21
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
22
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
23
Slowdown: Definition
Performance Alone
Slowdown 
Performance Shared
24
Key Observation 1
Normalized Performance
For a memory bound application,
Performance  Memory request service rate
1
omnetpp
0.9
Difficult
mcf
0.8
Requestastar
Service
Performanc
e AloneRate Alone
Intel Core i7, 4 cores
Slowdown0.6
Mem.
Bandwidth: 8.5 GB/s
Performanc
e
SharedRate
Request
Service
Shared
0.5
0.7
0.4
Easy
0.3
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Request Service Rate
25
Key Observation 2
Request Service Rate Alone (RSRAlone) of an
application can be estimated by giving the
application highest priority in accessing
memory
Highest priority  Little interference
(almost as if the application were run alone)
26
Key Observation 2
1. Run alone
Time units
Request Buffer State
Service order
3
2
Main
Memory
1
Main
Memory
2. Run with another application
Time units
Request Buffer State
Main
Memory
Service order
3
2
1
Main
Memory
3. Run with another application: highest priority
Time units
Request Buffer State
Main
Memory
3
Service order
2
1
Main
Memory
27
Memory Interference-induced Slowdown Estimation
(MISE) model for memory bound applications
Request Service Rate Alone (RSRAlone)
Slowdown 
Request Service Rate Shared (RSRShared)
28
Key Observation 3
• Memory-bound application
Compute Phase
Memory Phase
No
interference
With
interference
Req
Req
Req
time
Req
Req
Req
time
Memory phase slowdown dominates overall slowdown
29
Key Observation 3
• Non-memory-bound application
Compute Phase
Memory Phase
Memory Interference-induced Slowdown Estimation

1
(MISE) model for non-memory bound applications
No
interference
RSRAlone
time
Slowdown  (1 -  )  
RSRShared
With
interference
1
RSRAlone

RSRShared
time
Only memory fraction () slows down with interference
30
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
31
MISE Operation: Putting it All Together
Interval
Interval
time


Measure RSRShared, 
Estimate RSRAlone


Measure RSRShared, 
Estimate RSRAlone
Estimate
slowdown
Estimate
slowdown
32
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
33
Previous Work on Slowdown
Estimation
• Previous work on slowdown estimation
– STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07]
– FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ‘10]
• Basic Idea:
Difficult
Stall Time Alone
Slowdown 
Stall Time Shared
Easy
Count number of cycles application receives interference
34
Two Major Advantages of MISE Over STFM
• Advantage 1:
– STFM estimates alone performance while an
application is receiving interference  Difficult
– MISE estimates alone performance while giving an
application the highest priority  Easier
• Advantage 2:
– STFM does not take into account compute phase for
non-memory-bound applications
– MISE accounts for compute phase  Better accuracy
35
Methodology
• Configuration of our simulated system
–
–
–
–
4 cores
1 channel, 8 banks/channel
DDR3 1066 DRAM
512 KB private cache/core
• Workloads
– SPEC CPU2006
– 300 multi programmed workloads
36
Quantitative Comparison
SPEC CPU 2006 application
leslie3d
Slowdown
4
3.5
3
2.5
Actual
STFM
MISE
2
1.5
1
0
20
40
60
80
100
Million Cycles
37
4
4
3
3
3
2
1
2
1
4
Average
of MISE:
0
50
0error 50
100 8.2%
100
cactusADM
GemsFDTD
soplex
Average error of STFM: 29.4%
4
4
(across
300 workloads)
3
3
Slowdown
3
2
1
0
2
1
0
0
1
50
Slowdown
0
2
0
0
0
Slowdown
Slowdown
4
Slowdown
Slowdown
Comparison to STFM
50
wrf
100
100
2
1
0
0
50
calculix
100
0
50
100
povray
38
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
39
MISE-QoS: Providing
“Soft” Slowdown Guarantees
• Goal
1. Ensure QoS-critical applications meet a prescribed
slowdown bound
2. Maximize system performance for other applications
• Basic Idea
– Allocate just enough bandwidth to QoS-critical
application
– Assign remaining bandwidth to other applications
40
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
41
A Recap
• Problem: Shared resource interference causes
high and unpredictable application slowdowns
• Approach:
– Simple mechanisms to mitigate interference
– Slowdown estimation models
– Slowdown control mechanisms
• Future Work:
– Extending to shared caches
42
Shared Cache Interference
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
43
Impact of Cache Capacity Contention
Shared Main Memory
Shared Main Memory and Caches
2
Slowdown
Slowdown
2
1.5
1
0.5
0
1.5
1
0.5
0
bzip2 (core 0) soplex (core 1)
bzip2 (core 0) soplex (core 1)
Cache capacity interference causes high
application slowdowns
44
Backup Slides
45
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
46
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
47
Request Service vs. Memory Access
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Memory
Access Rate
Shared
Cache
Main
Memory
Request
Service Rate
Request service and access rates tightly coupled
48
Estimating Cache and Memory Slowdowns
Through Cache Access Rates
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache
Access Rate
Shared
Cache
Main
Memory
Cache Access Rate Alone
Slowdown 
Cache Access Rate Shared
49
1.5
1.4
1.3
1.2
1.1
1
Slowdown
Slowdown
Cache Access Rate vs. Slowdown
1
1.2
1.4
Cache Access Rate Ratio
bzip2
2.2
2
1.8
1.6
1.4
1.2
1
1
1.5
2
2.5
Cache Access Rate Ratio
xalancbmk
50
Challenge
How to estimate alone cache access rate?
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache
Access Rate
Shared
Cache
Auxiliary
Tags
Main
Memory
Priority
51
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
52
Leveraging Slowdown Estimates
for Performance Optimization
• How do we leverage slowdown estimates to
achieve high performance by allocating
– Memory bandwidth?
– Cache capacity?
• Leverage other metrics along with slowdowns
– Memory intensity
– Cache miss rates
53
Coordinated Resource
Allocation Schemes
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache capacity-aware
bandwidth allocation
Shared
Cache
Main
Memory
Bandwidth-aware
cache capacity allocation
54
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
55
Coordinated Resource Management
Schemes for Predictable Performance
Goal: Cache capacity and memory bandwidth
allocation for an application to meet a bound
Challenges:
• Large search space of potential cache capacity
and memory bandwidth allocations
• Multiple possible combinations of
cache/memory allocations for each application
56
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
57
Timeline
Apr. May Jun. Jul.
2014
2015
Aug. Sep. Oct. Nov. Dec. Jan. Feb. Mar. Apr. May
Cache slowdown estimation (75% Goal)
Coordinated cache/memory management
for performance (100% Goal)
Coordinated cache/memory management
for predictability (125% Goal)
Writeup thesis and defend
58
Summary
• Problem: Shared resource interference causes
high and unpredictable application slowdowns
• Goals: High and predictable performance
• Our Approach:
– Simple mechanisms to mitigate interference
– Slowdown estimation models
– Coordinated cache/memory management
59
Thank You!
60
Backup Slides
61
Prior Work: Cache and Memory QoS
• Resource QoS (SIGMETRICS 2007, ICS 2009, TACO 2010)
• Bubble up (MICRO 2011)
• Cache capacity QoS (ASPLOS 2014)
• Provide QoS on resource allocations
• Complementary to our slowdown-based mechanisms
62
Prior Work: Memory Interference Mitigation
• Source throttling (ASPLOS 2010)
– Throttle interfering applications at the core
• Memory interleaving (MICRO 2011)
– Map data to banks to exploit parallelism and row
locality
Complementary to our proposals
63
Prior Work: Shared Cache Management
• Cache replacement policies (PACT 2008, ISCA 2010)
• Insertion policies (ACSAC 2007, MICRO 2011)
• Partitioning (ICS 2004, MICRO 2006, ASPLOS 2014)
• Focused on a single resource: shared cache
• Goal of these policies: Improve performance
Complementary to our slowdown-based mechanisms
64
Prior Work: DRAM Optimizations
65
Related Work
•
•
•
•
Main memory interference mitigation
Slowdown estimation
Shared cache capacity management
Cache and memory QoS
66
Main Memory Interference Mitigation
• Source throttling (Ebrahimi et al., ASPLOS 2010)
– Throttle interfering applications at the core
• Memory interleaving (Kaseridis et al., MICRO 2011)
– Map data to banks to exploit parallelism and row
locality
67
Blacklisting
68
Blacklisting Scheduler Mechanism
• At each channel,
– Count the number of consecutive requests served
from an application
• The counter is cleared
– when a request belongs to a different application than
the previous one is served
• When count equals N
– clear the counter
– blacklist the application
• Periodically clear blacklists
69
Complexity Results
Latency (in ns)
10
8
120000
DDR3-800
FRFCFS-Cap
DDR3-1333
6
4
2
0
FRFCFS
PARBS
ATLAS
DDR4-3200
TCM
Blacklisting
Area (in square um)
12
100000
80000
60000
40000
20000
FRFCFS
FRFCFS-Cap
PARBS
ATLAS
TCM
Blacklisting
0
70
0.4
0.3
FRFCFS
0.2
PARBS
0.1
TCM
0
Blacklisting
0
10
20
Streak Length
libquantum
(High memory-intensity
application)
Fraction of Requests
Fraction of Requests
Understanding Why Blacklisting Works
0.5
0.4
0.3
FRFCFS
0.2
PARBS
0.1
TCM
0
Blacklisting
0
10
20
Streak Length
calculix
(Low memory-intensity
application)
Blacklisting shifts the request distribution
towards
towardsthe
theright
left
71
Effect of Workload Memory Intensity
72
Combining FRFCFS-Cap and Blacklisting
73
Sensitivity to Blacklisting Threshold
74
Sensitivity to Clearing Interval
75
Sensitivity to Core Count
76
Sensitivity to Channel Count
77
TCM vs. Grouping
78
Comparison with TCM-like Clustering
79
BLISS vs. Criticality-aware Scheduling
80
Sub-row Interleaving
81
MISE
82
Measuring RSRShared and α
• Request Service Rate Shared (RSRShared)
– Per-core counter to track number of requests serviced
– At the end of each interval, measure
Number of Requests Serviced
RSRShared 
Interval Length
• Memory Phase Fraction ( )
– Count number of stall cycles at the core
– Compute fraction of cycles stalled for memory
83
Estimating Request Service Rate Alone
(RSRAlone)
• Divide each interval into shorter epochs
Goal: Estimate RSRAlone
• At the beginning of each epoch
How:
Periodically
give each
– Memory
controller randomly
picks anapplication
application as the
highest
priority
application
highest
priority
in accessing memory
• At the end of an interval, for each
application, estimate
Number of Requests During High Priority Epochs
RSRAlone 
Number of Cycles Applicatio n Given High Priority
84
Inaccuracy in Estimating RSRAlone
Request
Bufferan
• When
State
Service order
Time units
application
has highest
priority
3
2
– Still experiences
Main some interference
1
Memory
Request Buffer
State
Time units
3
Main
Memory
Time units
Request Buffer
State
3
Main
Memory
Time units
3
High Priority
Main
Memory
Service order
2
1
Main
Memory
Service order
2
1
Main
Memory
Service order
2
Interference Cycles
1
Main
Memory
85
Accounting for Interference in RSRAlone Estimation
• Solution: Determine and remove interference
cycles from RSRAlone calculation
RSRAlone 
Number of Requests During High Priority Epochs
Number of Cycles Applicatio n Given High Priority - Interference Cycles
• A cycle is an interference cycle if
– a request from the highest priority application is waiting
in the request buffer and
– another application’s request was issued previously
86
Other Results in the Paper
• Sensitivity to model parameters
– Robust across different values of model parameters
• Comparison of STFM and MISE models in
enforcing soft slowdown guarantees
– MISE significantly more effective in enforcing guarantees
• Minimizing maximum slowdown
– MISE improves fairness across several system configurations
87
Quantitative Comparison
SPEC CPU 2006 application
hmmer
4
Slowdown
3.5
3
2.5
Actual
Actual
Actual
STFM
STFM
MISE
2
1.5
1
0
20
40
60
80
100
Million Cycles
88
MISE-QoS: Mechanism to Provide Soft QoS
• Assign an initial bandwidth allocation to QoScritical application
• Estimate slowdown of QoS-critical application
using the MISE model
• After every N intervals
– If slowdown > bound B +/- ε, increase bandwidth allocation
– If slowdown < bound B +/- ε, decrease bandwidth allocation
• When slowdown bound not met for N intervals
– Notify the OS so it can migrate/de-schedule jobs
89
A Look at One Workload
Slowdown Bound = 10
Slowdown Bound = 3.33
Slowdown Bound = 2
3
2.5
Slowdown
AlwaysPrioritize
MISE2 is effective in
MISE-QoS-10/1
1. 1.5
meeting the slowdown bound for the QoS-critical
MISE-QoS-10/3
MISE-QoS-10/5
application
1
MISE-QoS-10/7
2. improving performance of non-QoS-critical
MISE-QoS-10/9
0.5
applications
0
leslie3d
QoS-critical
hmmer
lbm
omnetpp
non-QoS-critical
90
Performance of Non-QoS-Critical Applications
Harmonic Speedup
1.4
1.2
1
0.8
0.6
0.4
0.2
AlwaysPrioritize
MISE-QoS-10/1
MISE-QoS-10/3
MISE-QoS-10/5
MISE-QoS-10/7
MISE-QoS-10/9
0
0
1
2
3
Avg
When
slowdown
bound
is
10/3
Higher
performance
when
bound
is loose
Number of Memory Intensive Applications
MISE-QoS improves system performance by 10%
91
Case Study with Two QoS-Critical Applications
10
9
• Two
comparison points
8
AlwaysPrioritize
– Always prioritize both applications
EqualBandwidth
6
– Prioritize each application 50% of timeMISE-QoS-10/1
Slowdown
7
5
4
3
2
1
MISE-QoS-10/2
MISE-QoS-10/3
MISE-QoS-10/4
MISE-QoS-10/5
MISE-QoS
MISE-QoScan
provides
achieve
much
a lower
lower
slowdown
slowdowns
bound
for
astar non-QoS-critical
mcf
leslie3d
mcf
for both
applications
applications
0
92
Minimizing Maximum Slowdown
• Goal
– Minimize the maximum slowdown experienced by any
application
• Basic Idea
– Assign more memory bandwidth to the more slowed
down application
93
Mechanism
• Memory controller tracks
– Slowdown bound B
– Bandwidth allocation of all applications
• Different components of mechanism
– Bandwidth redistribution policy
– Modifying target bound
– Communicating target bound to OS periodically
94
Bandwidth Redistribution
• At the end of each interval,
– Group applications into two clusters
– Cluster 1: applications that meet bound
– Cluster 2: applications that don’t meet bound
– Steal small amount of bandwidth from each
application in cluster 1 and allocate to applications in
cluster 2
95
Modifying Target Bound
• If bound B is met for past N intervals
– Bound can be made more aggressive
– Set bound higher than the slowdown of most slowed down
application
• If bound B not met for past N intervals by more
than half the applications
– Bound should be more relaxed
– Set bound to slowdown of most slowed down application
96
Results: Harmonic Speedup
0.7
Harmonic Speedup
0.6
0.5
0.4
FRFCFS
ATLAS
0.3
TCM
STFM
0.2
MISE-Fair
0.1
0
4
8
16
Core Count
97
Results: Maximum Slowdown
16
Maximum Slowdown
14
12
10
FRFCFS
8
ATLAS
TCM
6
STFM
MISE-Fair
4
2
0
4
8
16
Core Count
98
Sensitivity to Memory Intensity
(16 cores)
25
Maximum Slowdown
20
FRFCFS
15
ATLAS
TCM
10
STFM
MISE-Fair
5
0
0
25
50
75
100
Avg
99
MISE: Per-Application Error
Benchmark
STFM
MISE
Benchmark
STFM
MISE
453.povray
56.3
0.1
473.astar
12.3
8.1
454.calculix
43.5
1.3
456.hmmer
17.9
8.1
400.perlbench
26.8
1.6
464.h264ref
13.7
8.3
447.dealII
37.5
2.4
401.bzip2
28.3
8.5
436.cactusADM
18.4
2.6
458.sjeng
21.3
8.8
450.soplex
29.8
3.5
433.milc
26.4
9.5
444.namd
43.6
3.7
481.wrf
33.6
11.1
437.leslie3d
26.4
4.3
429.mcf
83.74
11.5
403.gcc
25.4
4.5
445.gobmk
23.1
12.5
462.libquantum
48.9
5.3
483.xalancbmk
18
13.6
459.GemsFDTD
21.6
5.5
435.gromacs
31.4
15.6
470.lbm
6.9
6.3
482.sphinx3
21
16.8
473.astar
12.3
8.1
471.omnetpp
26.2
17.5
456.hmmer
17.9
8.1
465.tonto
32.7
19.5
100
Sensitivity to Epoch and Interval
Lengths
Interval
Length
Epoch
Length
1 mil.
5 mil.
10 mil.
25 mil.
50 mil.
1000
65.1%
9.1%
11.5%
10.7%
8.2%
10000
64.1%
8.1%
9.6%
8.6%
8.5%
100000
64.3%
11.2%
9.1%
8.9%
9%
1000000
64.5%
31.3%
14.8%
14.9%
11.7%
101
Workload Mixes
Mix No.
Benchmark 1
Benchmark 2
Benchmark 3
1
sphinx3
leslie3d
milc
2
sjeng
gcc
perlbench
3
tonto
povray
wrf
4
perlbench
gcc
povray
5
gcc
povray
leslie3d
6
perlbench
namd
lbm
7
h264ref
bzip2
libquantum
8
hmmer
lbm
omnetpp
9
sjeng
libquantum
cactusADM
10
namd
libquantum
mcf
11
xalancbmk
mcf
astar
12
mcf
libquantum
leslie3d
102
STFM’s Effectiveness in Enforcing QoS
Across 3000 data points
Predicted
Met
Predicted
Not Met
QoS Bound
Met
63.7%
16%
QoS Bound
Not Met
2.4%
17.9%
103
STFM vs. MISE’s System Performance
0.95
System Performance
0.9
0.85
QoS-10/1
QoS-10/3
QoS-10/5
0.8
QoS-10/7
QoS-10/9
0.75
0.7
MISE
STFM
104
MISE’s Implementation Cost
1. Per-core counters worth 20 bytes
• Request Service Rate Shared
• Request Service Rate Alone
– 1 counter for number of high priority epoch requests
– 1 counter for number of high priority epoch cycles
– 1 counter for interference cycles
• Memory phase fraction ( )
2. Register for current bandwidth allocation – 4
bytes
3. Logic for prioritizing an application in each epoch
105
MISE Accuracy w/o Interference Cycles
• Average error – 23%
106
MISE Average Error by Workload Category
Workload Category (Number of
memory intensive applications)
Average Error
0
4.3%
1
8.9%
2
21.2%
3
18.4%
107
Initial Ideas
• Separate slowdown into cache and memory
slowdowns
• Determine resource allocations based on
cache, memory and overall slowdowns
108
QoS in Heterogeneous Systems
• Staged memory scheduling
– In collaboration with Rachata
Ausavarungnirun, Kevin Chang and Gabriel Lob
– Goal: High performance in CPU-GPU systems
• Memory scheduling in heterogeneous systems
– In collaboration with Hiroukui Usui
– Goal: Meet deadlines for accelerators while
improving performance
109
Publications
110
Download