Controlled Resource and Data Sharing in Multi-Core Platforms* The University of Rochester

advertisement
The University of Rochester
Controlled Resource and Data
Sharing in Multi-Core Platforms*
• Small private research
university
• 4400 undergraduates
• 2800 graduate students
• Set on the Genesee River
in Western NY State, near
the south shore of Lake
Ontario
• 250km by road from
Toronto; 590km from New
York City
Sandhya Dwarkadas
Department of Computer Science
University of Rochester
*Joint work with Arrvindh Shriraman, Hemayet Hossain, Xiao Zhang,
Hongzhou Zhao, Rongrong Zhong,
Michael L. Scott, Michael Huang, Kai Shen
1
2
The Computer Science Dept.
• 15 tenure-track
faculty; 45 Ph.D.
students
• Specializing in AI,
theory, and parallel
and distributed
systems
• Among the best
small departments in
the US
3
4
1
The Hardware-Software Interface
TreadMarks
Cashmere-2L
RTM
Concurrency: Coherence,
Synchronization, Consistency
P
…
P
P
M
LADLE
FCS
Distributed Systems
InterWeave Server
data
InterWeave
FlexTM
Internet
Memory Systems
ARMCO
DDCache
RPPT Willow
Sentry:
Multi-Cores
Protection Support
Power-Aware Computing
CAP
DT-CMT
IW library
IW library
cache
cache
IW library
MCD
Operating Systems
Resource-Aware OS Scheduling
Peer-to-peer systems
cache
Cluster
Fortran/C
The Implications of Technology
Scaling
Handheld Device
Java
Desktop
C/C++
•
•
•
•
•
Many more transistors for compute power
Energy constraints
Large volumes of data
High-speed communication
Concurrency (parallel or distributed)
• Need support for
– Scalable sharing
– Reliability
– Protection and security
– Performance isolation
Source: http://weblog.infoworld.com/tech-bottom-line/archives/IT-In-The-Clouds_hp.jpg
5
6
Multi-Core Challenges
Current Projects
•
• Ensuring performance isolation
• Providing protected and controlled sharing
across cores
• Scaling support for data sharing
•
CoSyn: Communication and Synchronization Mechanisms for
Emerging Multi-Core Processors
– Collaboration with Professors Michael Scott and Michael Huang
– Arrvindh Shriraman, Hemayet Hossain, Hongzhou Zhao
Operating System-Level Resource Management in the Multi-Core
Era
– Collaboration with Professor Kai Shen
– Xiao Zhang and Rongrong Zhong
See http://www.cs.rochester.edu/research/cosyn
and http://www.cs.rochester.edu/~sandhya
7
8
2
Multi-Core Challenges
• Ensuring performance isolation
• Providing protected and controlled sharing
across cores
• Scaling support for data sharing
Performance Isolation
9
Resource Sharing is (and will be)
Ubiquitous!
10
Resource Sharing on Multicore Chip
• Memory bandwidth and last level cache are
commonly shared by sibling cores sitting on the
same chip
Intel’s 6-core (12-thread), …
Sun UltraSparc T1, …
AMD’s 12-core, …
• Floating point, integer, state, cache with multiple
threads on a core
• Second-level cache with multiple cores on a
chip
• Interconnect bandwidth on multiprocessors
11
11
12
3
Poor Performance Due to
Uncontrolled Resource Contention
Resource Management To Date
• Capitalistic - generation of more requests results
in more resource usage
– Performance: resource contention can result
in significantly reduced overall performance
– Fairness: equal time slice does not
necessarily guarantee equal progress
Win-win situation
13
Experiments were conducted on a 3Ghz Intel Core 2 Duo processor with a shared 4MB L2 cache
13
Fluctuating Performance Due to
Uncontrolled Resource Contention
14
Fairness and Security Concerns
Performance of art when co-running with different applications
on an Intel dual-core processor with a 4MB shared L2 cache
15
•
•
•
•
Priority inversion
Poor fairness among competing applications
Information leakage at chip level
Denial of service attack at chip level
16
4
Existing Mechanism(I):
Software based Page Coloring
Big Picture
Control resource usage of co-running
applications
Thread A’s footprint
• Classic technique to reduce cache misses,
Memory page
now used by OS to manage cache
A1
partitioning
Page coloring or Hardware throttling
[Eurosys’09]
[USENIX’09]
Select which applications run
together
Resource-aware scheduling
[USENIX’10]
A
B
C
………
…..
D
A2
• Partition cache at coarse
granularity
Thread A
• No need for hardware
support
Thread B
X
A3
A4
A5
Way-1
• Expensive re-coloring cost
– Prohibitive in a dynamic environment where
Thread A’s footprint
frequent re-coloring may be necessary
A1
• Complex memory management
– Introduces artificial memory pressure
A2
Way-1
…………
A3
Way-n
Way-n
Shared Cache
17
Drawbacks of Page Coloring
…………
18
Toward Practical Page Coloring
• Hotness-based Page Coloring
– Efficiently find a small group of hot pages
– Restrain page coloring or re-coloring to hot pages
– Pay less re-coloring overhead while achieving most of the
cache partitioning benefit (separate competing
applications’ most frequently accessed pages)
Thread A
A4
Thread B
A5
• Key challenge
– Efficient way to track page hotness
Shared Cache
Memory page
19
20
5
Sampling of Access Bits
Methods to Track Page Hotness
• Using page protection
• Decouple sampling frequency and window
– Capture page accesses by triggering page faults
– Microseconds overhead per page fault
– Hotness sampling accuracy is determined by sampling time
window T
– Hotness sampling overhead is determined by sampling
frequency N
• Using access bits
– A single bit stored in each Page Table Entry (PTE)
– Generally available on x86, automatically set by hardware
upon page access
– Tens of cycles per page table entry check
Clear all access bits
N
– Recycle spare bits in PTE as hotness counter
Clear all access bits
2N
3N
4N
Time
0
• Counter is aged to reflect recency and frequency
T
N+T
Check all access bits
2N+T
In our experiments, T = 2 milliseconds
3N+T
4N+T
Check all access bits
N = 100 or 10 milliseconds
21
Miss-Ratio-Curve Driven
Cache Partition Policy
22
Hot Page Coloring
• Budget control of page re-coloring overhead
System optimization metric =
0.2
0.5
0.7
0.3
– % of time slice, e.g. 5%
Thread A’s Miss Ratio
• Recolor from hottest until budget is reached
Cache Allocation
0
4M
Cache Size = ∑A,B Cache Allocation
Thread B’s Miss Ratio
Cache Allocation
0
– Maintain a set of hotness bins during sampling
• bin[ i ][ j ] = # of pages in color i with normalized hotness in
range [ j, j+1]
– Given a budget K, K-th hottest page’s hotness value is
estimated in constant time by searching hotness bins
– Make sure hot pages are uniformly distributed among colors
4M
Optimal partition point
23
24
6
Re-coloring Procedure
Performance Comparison
Cache share decrease
Budget = 3 pages
hotness counter value
0
14
…
…
…
71
83
100
3
11
…
…
…
75
82
98
Color Red
2
12
…
…
…
73
81
97
Color Blue
1
10
…
…
…
74
87
99
Color Green
4 SPECcpu2k benchmarks (art,
equake, mcf, and twolf) are running
on 2 sibling cores (Intel core2duo)
that share a 4MB L2 cache.
Page sorted in
hotness ascending
order
X
Color Gray
25
26
Additional Benefit of
Hotness-based Page Coloring
• Page coloring
introduces artificial
memory pressure
– App’s footprint is larger
than its entitled memory
color pages, but system
still has an abundance of
Thread A
memory pages
• Allow app to “steal” Thread B
other’s colors, but it
preferentially copies
cold pages to other’s
memory colors
Big Picture
Control resource usage of co-running
applications
Thread A’s footprint
Memory page
Cache
Way-1
…………
A1
Page coloring or Hardware throttling
A2
Way-n
Select which applications run
together
A3
Resource-aware scheduling
A4
A
A5
27
B
C
D
………
…..
X
28
7
Comparing Hardware Execution
Throttling to Page Coloring
Hardware Execution Throttling
• Instead of directly controlling resource allocation,
throttle the execution speed of application that
overuses resource
• Kernel code modification complexity
• Available throttling knobs
• Runtime overhead of configuration
– Code length: 40 lines in a single file, as a reference our
page coloring implementation takes 700+ lines of code
crossing 10+ files
– Duty-cycle modulation
– Frequency/voltage scaling
– Cache prefetchers
– Less than 1 microseconds, as a reference re-coloring a
page takes 3 microseconds
29
30
Drawback of Scheduling Quantum
Adjustment
Existing Mechanism(II):
Scheduling Quantum Adjustment
Coarse-grained control at scheduling quantum granularity may
result in fluctuating service delays for individual transactions
• Shorten the time slice of app that overuses cache
• May let core idle if there is no other active thread
available
Core 0 Thread A
Core 1
Thread B
idle
Thread A
Thread B
idle
Thread A
idle
Thread B
time
31
32
8
Comparison of Hardware Execution
Throttling to other two mechanisms
New Mechanism:
Hardware Execution Throttling [Usenix’09]
• Comparison to page coloring
– Little complexity to kernel
• Throttle the execution speed of app that overuses cache
– Duty cycle modulation
• Code length: 40 lines in a single file, as a reference our page coloring implementation
takes 700+ lines of code crossing 10+ files
• CPU works only in duty cycles and stalls in non-duty cycles
• Different from Dynamic Voltage Frequency Scaling
– Lightweight to configure
– Per-core vs. per-processor control
– Thermal vs. power management
• Read plus write register: duty-cycle 265 + 350 cycles, prefetcher 298 + 2065 cycles
• Less than 1 microseconds, as a reference re-coloring a page takes 3 microseconds
– Enable/disable cache prefetchers
• Comparison to scheduling quantum adjustment
• L1 prefetchers
– More fine-grained controlling
– IP: keeps track of instruction pointer for load history
– DCU: when detecting multiple loads from the same line within a time limit,
prefetches the next line
Quantum adjustment
• L2 prefetchers
Core 0
– Adjacent line: Prefetches the adjacent line of required data
– Stream: looks at streams of data for regular patterns
Core 1
Thread A
Hardware execution
throttling
idle
Thread B
time
33
Fairness Comparison
34
Performance Comparison
• Unfairness factor: coefficient of variation (deviationto-mean ratio, σ / μ) of co-running apps’ normalized
performances (normalization base is the executiontime/throughput when the application monopolizes
the whole chip)
• System efficiency: geometric mean of co-running apps’
normalized performances
• On average all three mechanisms achieve system efficiency
comparable to default sharing
• Case where severe interthread cache conflicts exist
favors segregation, e.g.
{swim, mcf}
• Case where well-interleaved
cache accesses exist favors
sharing, e.g. {mcf, mcf}
• On average all three
mechanisms are effective in
improving fairness
• Case {swim, SPECweb}
illustrates limitation of page
coloring
35
36
35
36
9
Policies for Hardware Throttling Based
Multicore Management
Model-Driven Iterative Framework
• User-defined service level agreements (SLAs)
• Customizable performance estimation model
– Proportional progress among competing threads
• Reference configuration set and linear approximation
• Currently incorporates duty cycle modulation and
frequency/voltage scaling
• Unfairness metric: coefficient of variation of threads’ performance
– Quality of service guarantee for high-priority application(s)
• Key challenge
– Throttling configuration space grows exponentially as
the number of cores increases
– Quickly determining optimal or close to optimal
throttling configurations is challenging
• Iterative refinement
• Prediction accuracy gets improved over time as more
configurations are added into reference set
37
38
Online Deployment:
Hill-Climbing Search Acceleration
Iterative Refinement Patterns
• For a m-throttling-level
n-core system, need to
compute nm times to
predict a “best” one
• Hill-climbing searches
along the best child
rather than all children
• Prunes the
computation space to
(m-1)n2
39
(X,Y,Z,U)
(X1,Y,Z,U)
(X-1,Y1,Z,U)
(X-1,Y-1,Z1,U)
…
(X,Y1,Z,U)
(X,Y-2,Z,U)
(X,Y-2,Z-1,U)
(X,Y,Z1,U)
(X,Y,Z,U1)
(X,Y-1,Z1,U)
(X,Y-1,Z,U1)
(X,Y-1,Z-2,U)
(X,Y-1,Z-1,U1)
… … …
40
10
Accuracy Evaluation
Capability of Satisfying SLAs
• Service Level Agreements (SLAs)
• Fairness-oriented: keep the unfairness below a threshold
• QoS-oriented: keep the QoS-core above a QoS
threshold
• 4 different unfairness/QoS thresholds for 5 sets
• Optimization goal: satisfy SLAs while optimizing
performance or power efficiency
• Test platform
• A quad-core Nehalem processor with 8MB shared L3
cache
• Search space from full CPU speed (duty cycle level 8) to
half CPU speed (duty cycle level 4), so 369 configurations
for each test
• Benchmarks: SPECCPU2k
•
•
•
•
•
Set-1: {mesa, art, mcf, equake}
Set-2: {swim, mgrid, mcf, equake}
Set-3: {swim, art, equake, twolf}
Set-4: {swim, applu, equake, twolf}
Set-5: {swim, mgrid, art, equake}
# Passing tests
Avg. num of
samples
Avg.
performance of
picked configs
that pass tests
Oracle
39/40
0
100%
Model
39/40
4.1
99.4%
Random
25/40
41
15
91.1%
Recall the search space has 369 configurations42
Accuracy of Performance Estimation
Big Picture
Error Rate = |Prediction – Actual| / Actual
Control resource usage of co-running
applications
Page coloring or Hardware throttling
Select which applications run
together
Resource-aware scheduling
A
43
B
C
D
………
…..
X
44
11
Resource-aware Scheduling
Similarity Grouping Scheduling
• Scheduling decision could significantly affect performance
• Group applications with similar cache miss ratio on the
same chip
– Separate high and low miss ratio apps on different chips
• Benefits
– Mitigate cache thrashing effect
– Avoid over-saturating memory bandwidth
– Engage per-chip DVFS-based power savings
• A single voltage setting applies to all sibling cores on existing
multicore chips
• High-miss-ratio chip runs at low frequency while low-miss-ratio chip
runs at high frequency
45
46
Model Accuracy
Frequency-to-Performance Model
Error Rate = (Prediction - Actual) / Actual
• Objective: explore power savings with bounded
performance loss
• Assumptions
– An application’s performance is linearly determined by
cache and memory access latencies
– Frequency scaling only affects on-chip accesses
– Miss ratio does not vary across frequencies
Normalized performance at frequency f = T(F) / T(f)
47
48
12
Model-based Dynamic Frequency Setting
Model-based Dynamic Frequency Setting
• Dynamically adjust CPU frequency based on
current running application’s behavior
– Collect cache miss ratio every 10 milliseconds
– Calculate an appropriate frequency setting based on
performance estimation model
• Guided by performance degradation threshold (e.g. 10%)
49
Hardware Counter-based Power
Containers: An OS Resource
50
Power Conditioning Using Power
Containers
• Cross-core activity influence
• Online calibration with actual measurement
• Application-transparent online request context
tracking
51
52
13
Power Conditioning Achieved
Using Targeted Throttling
Ongoing Work
• Variation-directed information and management
– Using behavior fluctuation to trigger
monitoring
– Supporting fine-grain resource accounting
– Developing policies to reshape behavior for
high dependability and low jitter
– Request-level power attribution, modeling,
and management
53
54
Arch/App: Shared Memory ++: DIMM
[TRANSACT’06, ISCA’07, ISCA’08,ICS’09]
•
Data Isolation (DI)
– Provide control over propagation of writes
– Buffer writes and allow group undo or propagation
Applications: Sand-boxing, transactional programming, speculation
•
Memory Monitoring (MM)
– Monitor memory at summary or individual cache line level
Applications: Synchronization/event notification, reliability, security,
watchpoints/debugging
See
http://www.cs.rochester.edu/research/cosyn
http://www.cs.rochester.edu/~sandhya
55
56
14
Arch/App/OS:
Protection: Separation of Privileges
Sentry: Light-Weight Auxiliary Memory
Access Control [ISCA’10]
• Access checks on an L1
miss
– Saves 90x energy
– Simplifies
implementation
• Metadata cache (M-cache)
accessed in parallel with
the L2 to speed up check
Reality: Today’s programs often consist of multiple modules
written by different programmers
Reliability and composability requires developing access and
interface conventions
57
58
The Indirection Problem
A(S)
Load
P0 A
1
0
1
0
0
0
1
0
1 1 1 1 1 1 1 1
0 0 0 0 0 0 1 1
1
Load A
Data A
3
3
Data A
Data
A
Data
A
AA(S)
(M)
P4
Home
DG
P11AA
2 DG A
Longer distance means longer latency
59
PACT 2008
60
15
Fine-Grain Data Sharing
Goal: Localize Shared Data Communication
Load
P0 A
A(S)
Simultaneous access to the same data by more
than one core while the data still resides in
some L1 cache
1
2
Data A
Load A
AA(S)
(M)A
Data
P4
Key Idea:
Data availability at P0:
2 vs 10 physical hops
(whether P4 holds A in M or S)
Fine-grain sharing can be leveraged to localize
communication
61
P11 A
Home
62
Summary
• Harnessing 50B transistors requires a fresh look at conventional
hardware-software boundaries and interfaces with support for
– Scalable coherence design
– Controlled data sharing via architectural support for
• Memory monitoring, isolation, and protection
– Controlled resource sharing via operating system-level policies
for
• Performance isolation
•
We have examined coherence protocol additions to allow
– Fast event-based communication
– Fine-grain access control
– Programmable support for isolation
– Low-latency access for fine-grain data sharing
– Software to determine policy decisions in a flexible manner
 A combined hardware/software approach to support for concurrency with
improved performance and scalability
63
16
Download