Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems

advertisement
Providing High and Predictable Performance
in Multicore Systems
Through Shared Resource Management
Lavanya Subramanian
1
Shared Resource Interference
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
2
6
6
5
5
Slowdown
Slowdown
High and Unpredictable
Application Slowdowns
4
3
2
1
0
4
3
2
1
0
leslie3d (core 0)
gcc (core 1)
leslie3d (core 0)
mcf (core 1)
application
slowdowns depends
due to
2.1.
AnHigh
application’s
performance
shared
resource interference
on which
application
it is running with
3
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
4
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
5
Background: Main Memory
Row-buffer miss
hit
Memory
Controller
Rows
Columns
Bank 0
Bank 1
Bank 2
Bank 3
Row
Row
Buffer
Buffer
Row
Buffer
Row
Buffer
Row
Buffer
Channel
• FR-FCFS Memory Scheduler
[Zuravleff and Robinson, US Patent ‘97; Rixner et al., ISCA ‘00]
– Row-buffer hit first
– Older request first
• Unaware of inter-application interference
6
Tackling Inter-Application Interference:
Memory Request Scheduling
• Monitor application memory access
characteristics
• Rank applications based on memory access
characteristics
• Prioritize requests at the memory
controller, based on ranking
7
An Example:
Thread Cluster Memory Scheduling
Memory-non-intensive
thread
Throughput
thread
thread
thread
thread
thread
Non-intensive
cluster
higher
priority
higher
priority
Prioritized
thread
Threads in the system
Memory-intensive
Intensive cluster
Fairness
Figure: Kim et al., MICRO 2010
8
Problems with Previous
Application-aware Memory Schedulers
• Hardware Complexity
– Ranking incurs high hardware cost
• Unfair slowdowns of some applications
– Ranking causes unfairness
9
High Hardware Complexity
• Ranking incurs high hardware cost
– Rank computation incurs logic/storage cost
– Rank enforcement requires comparison logic
80000
8
6
4
2
FRFCFS
8x
TCM
Area (in square um)
Latency (in ns)
10
60000
1.8x
40000
FRFCFS
TCM
20000
Avoid ranking to achieve low
hardware cost
0
0
Synthesized with a 32nm standard cell library
10
astar
soplex
•
Lower-rank
applications
experience
40
80
30 significant slowdowns
60
50
Number of Requests
Number of Requests
Ranking Causes
Unfair Application Slowdowns
100
40
TCM
– Low memory service
causes
slowdown
20
Grouping
– Periodic rank shuffling not0sufficient
20
10
0
0
50
100
0
12
10
10
8
8
6
TCM
Grouping
Grouping
100
Execution Time (in 1000s of Cycles)
Slowdown
Slowdown
Execution Time (in 1000s of Cycles)
50
TCM
6
TCM
4
Grouping
Grouping offers lower unfairness
than
ranking
2
4
2
0
0
11
Problems with Previous
Application-Aware Memory Schedulers
• Hardware Complexity
– Ranking incurs high hardware cost
• Unfair slowdowns of some applications
– Ranking causes unfairness
Our Goal: Design a memory scheduler with
Low Complexity, High Performance, and Fairness
12
Towards a New Scheduler Design
• Monitor applications that have a number of
consecutive requests served
1. Simple Grouping Mechanism
• Blacklist such applications
• Prioritize requests of non-blacklisted applications
2. Enforcing Priorities Based On Grouping
• Periodically clear blacklists
13
Methodology
• Configuration of our simulated system
–
–
–
–
24 cores
4 channels, 8 banks/channel
DDR3 1066 DRAM
512 KB private cache/core
• Workloads
– SPEC CPU2006, TPCC, Matlab
– 80 multi programmed workloads
14
Metrics
• System Performance:
N
Harmonic Speedup 
IPCialone
 IPCshared
i
i
Slowdown
IPCishared
Weighted Speedup  
alone
i IPCi
10
6
4
2
0
0
0.5
1
Speedup
• Fairness:
Maximum Slowdown  max
8
alone
i
shared
i
IPC
IPC
15
Previous Memory Schedulers
• FR-FCFS [Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000]
– Prioritizes row-buffer hits and older requests
– Application-unaware
• PARBS [Mutlu and Moscibroda, ISCA 2008]
– Batches oldest requests from each application; prioritizes batch
– Employs ranking within a batch
• ATLAS [Kim et al., HPCA 2010]
– Prioritizes applications with low memory-intensity
• TCM [Kim et al., MICRO 2010]
– Always prioritizes low memory-intensity applications
– Shuffles request priorities of high memory-intensity applications
16
Performance Results
8
15
FRFCFS
FRFCFS-Cap
6
PARBS
4
ATLAS
TCM
2
0
Blacklisting
Maximum Slowdown
Weighted Speedup
10
FRFCFS
10
FRFCFS-Cap
PARBS
ATLAS
5
TCM
Blacklisting
0
Approaches fairness of PARBS and
FRFCFS-Cap
achieving better
5% higher
system performance
and 25%
performance
than TCM
lower maximum
slowdown
than TCM
17
Complexity Results
Latency (in ns)
10
8
6
120000
FRFCFS
FRFCFS-Cap
PARBS
ATLAS
4
TCM
2
Blacklisting
0
Area (in square um)
12
100000
80000
60000
FRFCFS
FRFCFS-Cap
PARBS
ATLAS
40000
TCM
20000
Blacklisting
0
Blacklisting achieves
43%lower
lowerlatency
area than
70%
thanTCM
TCM
18
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
19
Need for Predictable Performance
• There is a need for predictable performance
– When multiple applications share resources
– Especially if some applications require performance
guarantees
• Example
1: In step:
server systems
As a first
Predictable
performance
– Different users’ jobs consolidated onto the same server
in
thetopresence
of memory
– Need
provide bounded
slowdowns tointerference
critical jobs
• Example 2: In mobile systems
– Interactive applications run with non-interactive applications
– Need to guarantee performance for interactive applications
20
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
21
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
22
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
23
Slowdown: Definition
Performance Alone
Slowdown 
Performance Shared
24
Key Observation 1
Normalized Performance
For a memory bound application,
Performance  Memory request service rate
1
omnetpp
0.9
Difficult
mcf
0.8
Requestastar
Service
Performanc
e AloneRate Alone
Intel Core i7, 4 cores
Slowdown0.6
Mem.
Bandwidth: 8.5 GB/s
Performanc
e
SharedRate
Request
Service
Shared
0.5
0.7
0.4
Easy
0.3
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized Request Service Rate
25
Key Observation 2
Request Service Rate Alone (RSRAlone) of an
application can be estimated by giving the
application highest priority in accessing
memory
Highest priority  Little interference
(almost as if the application were run alone)
26
Key Observation 2
1. Run alone
Time units
Request Buffer State
Service order
3
2
Main
Memory
1
Main
Memory
2. Run with another application
Time units
Request Buffer State
Main
Memory
Service order
3
2
1
Main
Memory
3. Run with another application: highest priority
Time units
Request Buffer State
Main
Memory
3
Service order
2
1
Main
Memory
27
Memory Interference-induced Slowdown Estimation
(MISE) model for memory bound applications
Request Service Rate Alone (RSRAlone)
Slowdown 
Request Service Rate Shared (RSRShared)
28
Key Observation 3
• Memory-bound application
Compute Phase
Memory Phase
No
interference
With
interference
Req
Req
Req
time
Req
Req
Req
time
Memory phase slowdown dominates overall slowdown
29
Key Observation 3
• Non-memory-bound application
Compute Phase
Memory Phase
Memory Interference-induced Slowdown Estimation

1
(MISE) model for non-memory bound applications
No
interference
RSRAlone
time
Slowdown  (1 -  )  
RSRShared
With
interference
1
RSRAlone

RSRShared
time
Only memory fraction () slows down with interference
30
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
31
MISE Operation: Putting it All Together
Interval
Interval
time


Measure RSRShared, 
Estimate RSRAlone


Measure RSRShared, 
Estimate RSRAlone
Estimate
slowdown
Estimate
slowdown
32
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
33
Previous Work on Slowdown
Estimation
• Previous work on slowdown estimation
– STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ‘07]
– FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ‘10]
• Basic Idea:
Difficult
Stall Time Alone
Slowdown 
Stall Time Shared
Easy
Count number of cycles application receives interference
34
Two Major Advantages of MISE Over STFM
• Advantage 1:
– STFM estimates alone performance while an
application is receiving interference  Difficult
– MISE estimates alone performance while giving an
application the highest priority  Easier
• Advantage 2:
– STFM does not take into account compute phase for
non-memory-bound applications
– MISE accounts for compute phase  Better accuracy
35
Methodology
• Configuration of our simulated system
–
–
–
–
4 cores
1 channel, 8 banks/channel
DDR3 1066 DRAM
512 KB private cache/core
• Workloads
– SPEC CPU2006
– 300 multi programmed workloads
36
Quantitative Comparison
SPEC CPU 2006 application
leslie3d
Slowdown
4
3.5
3
2.5
Actual
STFM
MISE
2
1.5
1
0
20
40
60
80
100
Million Cycles
37
4
4
3
3
3
2
1
2
1
4
Average
of MISE:
0
50
0error 50
100 8.2%
100
cactusADM
GemsFDTD
soplex
Average error of STFM: 29.4%
4
4
(across
300 workloads)
3
3
Slowdown
3
2
1
0
2
1
0
0
1
50
Slowdown
0
2
0
0
0
Slowdown
Slowdown
4
Slowdown
Slowdown
Comparison to STFM
50
wrf
100
100
2
1
0
0
50
calculix
100
0
50
100
povray
38
Predictability in the Presence of
Memory Interference
1. Estimate Slowdown
– Key Observations
– MISE Operation: Putting it All Together
– Evaluating the Model
2. Control Slowdown
– Providing Soft Slowdown Guarantees
– Minimizing Maximum Slowdown
39
MISE-QoS: Providing
“Soft” Slowdown Guarantees
• Goal
1. Ensure QoS-critical applications meet a prescribed
slowdown bound
2. Maximize system performance for other applications
• Basic Idea
– Allocate just enough bandwidth to QoS-critical
application
– Assign remaining bandwidth to other applications
40
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
41
A Recap
• Problem: Shared resource interference causes
high and unpredictable application slowdowns
• Approach:
– Simple mechanisms to mitigate interference
– Slowdown estimation models
– Slowdown control mechanisms
• Future Work:
– Extending to shared caches
42
Shared Cache Interference
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Shared
Cache
Main
Memory
43
Impact of Cache Capacity Contention
Shared Main Memory
Shared Main Memory and Caches
2
Slowdown
Slowdown
2
1.5
1
0.5
0
1.5
1
0.5
0
bzip2 (core 0) soplex (core 1)
bzip2 (core 0) soplex (core 1)
Cache capacity interference causes high
application slowdowns
44
Backup Slides
45
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
46
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
47
Request Service vs. Memory Access
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Memory
Access Rate
Shared
Cache
Main
Memory
Request
Service Rate
Request service and access rates tightly coupled
48
Estimating Cache and Memory Slowdowns
Through Cache Access Rates
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache
Access Rate
Shared
Cache
Main
Memory
Cache Access Rate Alone
Slowdown 
Cache Access Rate Shared
49
1.5
1.4
1.3
1.2
1.1
1
Slowdown
Slowdown
Cache Access Rate vs. Slowdown
1
1.2
1.4
Cache Access Rate Ratio
bzip2
2.2
2
1.8
1.6
1.4
1.2
1
1
1.5
2
2.5
Cache Access Rate Ratio
xalancbmk
50
Challenge
How to estimate alone cache access rate?
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache
Access Rate
Shared
Cache
Auxiliary
Tags
Main
Memory
Priority
51
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
52
Leveraging Slowdown Estimates
for Performance Optimization
• How do we leverage slowdown estimates to
achieve high performance by allocating
– Memory bandwidth?
– Cache capacity?
• Leverage other metrics along with slowdowns
– Memory intensity
– Cache miss rates
53
Coordinated Resource
Allocation Schemes
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Cache capacity-aware
bandwidth allocation
Shared
Cache
Main
Memory
Bandwidth-aware
cache capacity allocation
54
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
55
Coordinated Resource Management
Schemes for Predictable Performance
Goal: Cache capacity and memory bandwidth
allocation for an application to meet a bound
Challenges:
• Large search space of potential cache capacity
and memory bandwidth allocations
• Multiple possible combinations of
cache/memory allocations for each application
56
Outline
Goals:
1. High performance
2. Predictable performance
• Blacklisting memory scheduler
• Predictability with memory
interference
• Coordinated cache/memory
management for performance
• Cache slowdown estimation
• Coordinated cache/memory
management for predictability
57
Timeline
Apr. May Jun. Jul.
2014
2015
Aug. Sep. Oct. Nov. Dec. Jan. Feb. Mar. Apr. May
Cache slowdown estimation (75% Goal)
Coordinated cache/memory management
for performance (100% Goal)
Coordinated cache/memory management
for predictability (125% Goal)
Writeup thesis and defend
58
Summary
• Problem: Shared resource interference causes
high and unpredictable application slowdowns
• Goals: High and predictable performance
• Our Approach:
– Simple mechanisms to mitigate interference
– Slowdown estimation models
– Coordinated cache/memory management
59
Download