ICPP-38 2009
Bank-aware Dynamic Cache Partitioning for
Multicore Architectures
Dimitris Kaseridis 1 Jeff Stuecheli 1,2 & Lizy K. John 1
1
University of Texas – Austin
2
IBM – Austin
Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Outline
Motivation/background
Cache partitioning/profiling
Proposed system
Results
Conclusion/future work
2 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Motivation
Shared resources in CMPs
– Last Level Cache
– Memory bandwidth
Opportunity and Pitfalls
– Constructive
• Mixing low and high cache requirements in shared pool
– Destructive
• Thrashing workloads (spec cpu 2000 art + mcf)
– Cache partitioning required
– Primary opportunity requires heterogeneous workload mixes
• Typical in consolidation + virtualization
9/23/2009 3 Laboratory for Computer Architecture
ICPP-38 2009
4
Monolithic vs NUCA vs Industry architectures
Monolithic: One large shared uniform latency cache bank on a CMP
– Does not exploit physical locality for private data
– Slow for all
CMP-NUCA: Typical proposal has a very large number of autonomous cache banks
– Very flexible (256 banks)
– Non optimal configuration
• Inefficient bank size (bank overhead)
Real implementations
– Fewer banks in industry
– NUCA with discrete cache levels
– Key is wire assumptions made in original NUCA analysis
Core Core cache
Cache cache
Core Core cache
Cache cache cache
Cache cache
Core Core cache
Cache cache
Core Core
IBM POWER7
Intel Nehalem EX
Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Baseline System
5 Laboratory for Computer Architecture
8 cores
16 MB total capacity
– 16 x 1 MB banks
– 8 way associative
Local Banks
– Tight latency to close core
Center Banks
– Shared capacity
9/23/2009
ICPP-38 2009
6 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Cache Sharing/Partitioning
Last level cache of CMP
– Once isolated resources now shared
• Drove need for isolation
– Design space
• Non-configurable
– Shared vs private caches
• Static partitioning/policy
– Long term policy choice
• Dynamic
– Real time profiling directed partitions
– Trial and error (experiment to find ideal configuration)
– Predictive profilers
> Non-invasive state space exploration (our system)
9/23/2009 7 Laboratory for Computer Architecture
ICPP-38 2009
Bank-aware cache partitions
System components
– Non-invasive profiling using MSA (Mattson
Stack Algorithm)
– Cache allocation using marginal utility
– Bank-aware LLC partitions
8 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
MSA Based Cache Profiling
Mattson stack algorithm
– Originally proposed to concurrently simulate many cache sizes
– Structure is a true LRU cache
– Stack distance from MRU of each reference is recorded
– Misses can be calculated for fraction of ways
9 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Hardware MSA implementation
Naïve algorithm is prohibitive
–
Fully associative
–
Complete cache directory of maximum cache size for every core on the CMP (total size)
Reduction
–
Set Sampling
–
Partial Tags
–
Maximal Capacity
Configuration in paper
–
12 bit tag
–
1/32 set sampling
–
9/16 bank per core
– 0.4% overhead of cache on chip sets ways
10 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Marginal Utility
Miss rate relative to capacity is non-linear, and heavily workload dependant
Dramatic miss rate reduction as data structures become cache contained
In practice,
–
Iteratively assign cache to cores that produce the most hits per capacity
11 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Bank-aware LLC partitions
a ideal MSA model
b banked true LRU
– Cascade banks
– Power inefficient
c realistic banking
– Allocation policy
• Hash allocation
• Random allocation
– Bank granularity
• Uniform requirement
9/23/2009 12 Laboratory for Computer Architecture
ICPP-38 2009
Bank-aware allocation heuristics
General idea
– As capacity grows, courser assignment is good enough
Only share portions of
Local cache banks between neighbors
Central banks are assigned to a specific core
– Any core to receive central banks is also assigned full local capacity
13 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Cache allocation flowchart
Assign full cache banks first (steps 1-3)
– All cores that have multiple banks are complete
Partition remaining local banks (steps 4-7)
– Fine tune assignment
– Sharing pairs
14 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
15 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Methodology
Workloads
– 8 cores running mix of 26 SPEC CPU 2000 workloads
– What benchmark mix?
– Typical is to classify with limited experiments
•
We wanted to cover a larger state space
Monte Carlo
– Compare bank aware miss rate to ideal assignment
•
Show algorithm works for many cases
Detailed simulation
– Cycle accurate
– Full system
• Simics+GEMS+CourseBanks+CachePartitions
9/23/2009 16 Laboratory for Computer Architecture
ICPP-38 2009
Monte Carlo
How close is Bank-aware assignment to ideal monolithic?
Graphic shows miss rate reduction
– 1000 random
SpecCPU 2000 benchmark mixes
97% correlation in miss rates
17 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Workload sets for detailed simulation
18 Laboratory for Computer Architecture 9/23/2009
ICPP-38 2009
Cycle accurate simulation
Overall
– Miss ratio
• 70% reduction over shared
• 25% over equal
– Throughput
• 43% increase over shared
• 11% increase over equal
1.2
1
0.8
0.6
0.4
0.2
0
1.2
1
0.8
0.6
0.4
0.2
0
No-Partitions Equal-Partitions Bank-aware
Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM
No-Partitions Equal-Partitions Bank-aware
Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM
9/23/2009 19 Laboratory for Computer Architecture
ICPP-38 2009
Conclusion/future work
Significant miss rate reduction/throughput improvement possible
– Partitions are very important
– Marginal utility can work with realistic banked CMP caches
Heterogeneous Benchmarks needed
– Can’t evaluate all combinations
– Hand chosen combinations are hard to compare across proposals
9/23/2009 20 Laboratory for Computer Architecture
ICPP-38 2009
Laboratory for Computer Architecture
University of Texas Austin
&
IBM Austin
21 Laboratory for Computer Architecture 9/23/2009