cache_partitioning

advertisement
Gaining Insights into Multi-Core Cache Partitioning:
Bridging the Gap between Simulation and Real Systems
Jiang Lin1, Qingda Lu2, Xiaoning Ding2,
Zhao Zhang1, Xiaodong Zhang2, and P. Sadayappan2
1
Department of ECE
Iowa State University
2
Department of CSE
The Ohio State University
Shared Caches Can be a Critical
Bottleneck in Multi-Core Processors
 L2/L3 caches are shared by multiple cores
Intel Xeon 51xx (2core/L2)
 AMD Barcelona (4core/L3)
 Sun T2, ...
(8core/L2)

Core
Core
……
Core
Shared L2/L3 cache
 Effective cache partitioning is critical to address the bottleneck
caused by the conflicting accesses in shared caches.
 Several hardware cache partitioning methods have been
proposed with different optimization objectives
 Performance: [HPCA’02], [HPCA’04], [Micro’06]
 Fairness: [PACT’04], [ICS’07], [SIGMETRICS’07]
 QoS: [ICS’04], [ISCA’07]
2
Limitations of Simulation-Based Studies
 Excessive simulation time
 Whole programs can not be evaluated. It would take several
weeks/months to complete a single SPEC CPU2006 benchmark
 As the number of cores continues to increase, simulation ability
becomes even more limited
 Absence of long-term OS activities
 Interactions between processor/OS affect performance
significantly
 Proneness to simulation inaccuracy
 Bugs in simulator
 Impossible to model many dynamics and details of the system
3
Our Approach to Address the Issues
Design and implement OS-based Cache Partitioning
 Embedding cache partitioning mechanism in OS
 By enhancing page coloring technique
 To support both static and dynamic cache partitioning
 Evaluate cache partitioning policies on commodity
processors
 Execution- and measurement-based
 Run applications to completion
 Measure performance with hardware counters
4
Four Questions to Answer
 Can we confirm the conclusions made by the simulation-
based studies?
 Can we provide new insights and findings that simulation is
not able to?
 Can we make a case for our OS-based approach as an
effective option to evaluate multicore cache partitioning
designs?
 What are advantages and disadvantages for OS-based cache
partitioning?
5
Outline
 Introduction
 Design and implementation of OS-based cache partitioning
mechanisms
 Evaluation environment and workload construction
 Cache partitioning policies and their results
 Conclusion
6
OS-Based Cache Partitioning
Mechanisms
 Static cache partitioning
 Predetermines the amount of cache blocks allocated to each
program at the beginning of its execution
 Page coloring enhancement
 Divides shared cache to multiple regions and partition cache
regions through OS page address mapping
 Dynamic cache partitioning
 Adjusts cache quota among processes dynamically
 Page re-coloring
 Dynamically changes processes’ cache usage through OS page
address re-mapping
7
Page Coloring
•Physically indexed caches are divided into multiple regions (colors).
•All cache lines in a physical page are cached in one of those regions (colors).
Physically indexed cache
Virtual address
virtual page number
OS control
Address translation
Physical address
physical page number
page offset
……
Page offset
=
OS can control the page color of a virtual page through address mapping
(by selecting a physical page with a specific value in its page color bits).
Cache address
8
Cache tag
Set index
page color bits
Block offset
Enhancement for Static Cache
Partitioning
Physical pages are grouped to page bins
according to their page color
1
2
3
4
…
……
…
…
OS address mapping
i
i+1
i+2
Physically indexed cache
…
Shared cache is partitioned between two……
processes through address mapping.
Process 1
...
1
2
3
4
……
Cost: Main memory space needs to be partitioned too (co-partitioning).
…
i
i+1
i+2
…
9
Process 2
…
……
…
……
Dynamic Cache Partitioning
 Why?
 Programs have dynamic behaviors
 Most proposed schemes are dynamic
 How?
 Page re-coloring
 How to handle overhead?
 Measure overhead by performance counter
 Remove overhead in result (emulating hardware schemes)
10
Dynamic Cache Partitioning through
Page Re-Coloring
Allocated
Allocated
color
color
0
1
2
 Pages of a process are organized into linked lists
by their colors.
 Memory allocation guarantees that pages are
evenly distributed into all the lists (colors) to
avoid hot points.
3
……
N-1
11
page links table
 Page re-coloring:
 Allocate page in new color
 Copy memory contents
 Free old page
Control the Page Migration Overhead
 Control the frequency of page migration
 Frequent enough to capture application phase changes
 Not too often to introduce large page migration overhead
 Lazy migration: avoid unnecessary page migration
 Observation: Not all pages are accessed between their two
migrations.
 Optimization: do not migrate a page until it is accessed
12
Lazy Page Migration
Allocated
Allocated
color
color
0
1
2
3
……
 After the optimization
N-1
13
Avoid unnecessary page migration for these pages!
Process page links
 On average, 2% page migration overhead
 Up to 7%.
Outline
 Introduction
 Design and implementation of OS-based cache partitioning
mechanisms
 Evaluation environment and workload construction
 Cache partitioning policies and their results
 Conclusion
14
Experimental Environment
 Dell PowerEdge1950
 Two-way SMP, Intel dual-core Xeon 5160
 Shared 4MB L2 cache, 16-way
 8GB Fully Buffered DIMM
 Red Hat Enterprise Linux 4.0
 2.6.20.3 kernel
 Performance counter tools from HP (Pfmon)
 Divide L2 cache into 16 colors
15
Benchmark Classification
6
9
6
8
29 benchmarks from SPEC CPU2006
 Is it sensitive to L2 cache capacity?
 Red group: IPC(1M L2 cache)/IPC(4M L2 cache) < 80%
 Give red benchmarks more cache: big performance gain
 Yellow group: 80% <IPC(1M L2 cache)/IPC(4M L2 cache) < 95%
 Give yellow benchmarks more cache: moderate performance gain
 Else: Does it extensively access L2 cache?
 Green group: > = 14 accesses / 1K cycle
 Give it small cache
 Black group: < 14 accesses / 1K cycle
16
 Cache insensitive
Workload Construction
2-core
6
6
RR (3 pairs)
9
RY (6 pairs)
YY (3 pairs)
6
RG (6 pairs)
YG (6 pairs)
9
6
GG (3 pairs)
27 workloads: representative benchmark combinations
17
Outline
 Introduction
 OS-based cache partitioning mechanism
 Evaluation environment and workload construction
 Cache partitioning policies and their results
 Performance
 Fairness
 Conclusion
18
Performance – Metrics
 Divide metrics into evaluation metrics and policy metrics
[PACT’06]
 Evaluation metrics:
 Optimization objectives, not always available during run-time
 Policy metrics
 Used to drive dynamic partitioning policies: available during run-time
 Sum of IPC, Combined cache miss rate, Combined cache misses
19
Static Partitioning
 Total #color of cache: 16
 Give at least two colors to each program
 Make sure that each program get 1GB memory to avoid
swapping (because of co-partitioning)
 Try all possible partitionings for all workloads
 (2:14), (3:13), (4:12) ……. (8,8), ……, (13:3), (14:2)
 Get value of evaluation metrics
 Compared with performance of all partitionings with
performance of shared cache
20
Performance
– Optimal Static Partitioning
Performance gain of optimal static partitioning
Throughtput
Average Weighted Speedup
Normalized SMT Speedup
Fair Speedup
1.25
1.20
1.15
1.10
1.05
1.00
RR
RY
RG
YY
YG
GG
 Confirm that cache partitioning has significant performance impact
 Different evaluation metrics have different performance gains
 RG-type of workloads have largest performance gains (up to 47%)
 Other types of workloads also have performance gains (2% to 10%)
21
A New Finding
 Workload RG1: 401.bzip2 (Red) + 410.bwaves (Green)
 Intuitively, giving more cache space to 401.bzip2 (Red)
 Increases the performance of 401.bzip2 largely (Red)
 Decreases the performance of 410.bwaves slightly (Green)
 However, we observe that
22
Insight into Our Finding
23
Partitionings
Partitionings
13:3
13:3
14:2
14:2
11:5
11:5
12:4
12:4
9:7
9:7
10:6
10:6
7:9
7:9
8:8
8:8
6:10
6:10
5:11
5:11
4:12
4:12
3.05
156
154
3.00
152
2.95
150
2.90
148
2.85
146
2.80
144
2.75
142
2.70
140
2:14
2:14
3:13
3:13
GB/s
ns
Memory
Utilization
AverageBandwidth
Memory Access
Latency
Insight into Our Finding
 We have the same observation in RG4, RG5 and YG5
 This is not observed by simulation
Did not model main memory sub-system in detail


Assumed fixed memory access latency
 Shows the advantages of our execution- and measurement-
base study
24
Performance
- Dynamic Partition Policy
Init: Partition the cache as (8:8)
finished
No
Yes
Exit
Run current partition (P0:P1) for one epoch

Try one epoch for each of the two neighboring
partitions: (P0 – 1: P1+1) and (P0 + 1: P1-1) 
A simple greedy policy.
Emulate policy of
[HPCA’02]
Choose next partitioning with
best policy metrics measurement
25
Performance – Static & Dynamic
 Use combined miss rates as policy metrics
 For RG-type, and some RY-type:
 Static partitioning outperforms dynamic partitioning
 For RR- and RY-type, and some RY-type
 Dynamic partitioning outperforms static partitioning
26
Fairness
– Metrics and Policy [PACT’04]
 Metrics
 Evaluation metrics FM0
 difference in slowdown, small is better
 Policy metrics
 Policy
 Repartitioning and rollback
27
Fairness - Result
 Dynamic partitioning can achieve better fairness
 If we use FM0 as both evaluation metrics and policy metrics
 None of policy metrics (FM1 to FM5) is good enough to drive the partitioning
policy to get comparable fairness with static partitioning
 Strong correlation was reported in simulation-based study – [PACT’04]
 None of policy metrics has consistently strong correlation with FM0
 SPEC CPU2006 (ref input)  SPEC CPU2000 (test input)
 Complete trillions of instructions  less than one billion instruction
 4MB L2 cache  512KB L2 cache
28
Conclusion
 Confirmed some conclusions made by simulations
 Provided new insights and findings
 Give cache space from one to another, increase performance of both
 Poor correlation between evaluation and policy metrics for fairness
 Made a case for our OS-based approach as an effective option for
evaluation of multicore cache partitioning
 Advantages of OS-based cache partitioning
 Working on commodity processors for an execution- and measurement-based
study
 Disadvantages of OS-based cache partitioning
 Co-partitioning (may underutilize memory), migration overhead
29
Ongoing Work
 Reduce migration overhead on commodity processors
 Cache partitioning at the compiler level
 Partition cache at object level
 Hybrid cache partitioning method
 Remove the cost of co-partitioning
 Avoid page migration overhead
30
Thanks!
Gaining Insights into Multi-Core Cache Partitioning:
Bridging the Gap between Simulation and Real Systems
Jiang Lin1, Qingda Lu2, Xiaoning Ding2,
Zhao Zhang1, Xiaodong Zhang2, and P. Sadayappan2
1 Iowa
State University
2 The
Ohio State University
Backup Slides
32
Fairness - Correlation between Evaluation Metrics and
Policy Metrics (Reported by [PACT’04])
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
apsi+equake
gzip+apsi
Corr(M1,M0)
Corr(M4,M0)
swim+gzip
Corr(M2,M0)
Corr(M5,M0)
tree+mcf
AVG18
Corr(M3,M0)
Strong correlation was reported in simulation study – [PACT’04]
33
Fairness - Correlation between Evaluation Metrics and
Policy Metrics (Our result)
1
0.8
FM1
0.6
0.4
0.2
FM3
0
-0.2
-0.4
-0.6
FM5
FM4
-0.8
-1
YY1
YY2
YY3
YG1
YG2
YG3
YG4
YG5
YG6
GG1
GG2
 None of policy metrics has consistently strong correlation with FM0
 SPEC CPU2006 (ref input)  SPEC CPU2000 (test input)
 Complete trillions of instructions  less than one billion instruction
 4MB L2 cache  512KB L2 cache
34
GG3
Download