IBM Presentations: Blue Pearl Basic template

advertisement

ICPP-38 2009

Bank-aware Dynamic Cache Partitioning for

Multicore Architectures

Dimitris Kaseridis 1 Jeff Stuecheli 1,2 & Lizy K. John 1

1

University of Texas – Austin

2

IBM – Austin

Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Outline

 Motivation/background

 Cache partitioning/profiling

 Proposed system

 Results

 Conclusion/future work

2 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Motivation

 Shared resources in CMPs

– Last Level Cache

– Memory bandwidth

 Opportunity and Pitfalls

– Constructive

• Mixing low and high cache requirements in shared pool

– Destructive

• Thrashing workloads (spec cpu 2000 art + mcf)

– Cache partitioning required

– Primary opportunity requires heterogeneous workload mixes

• Typical in consolidation + virtualization

9/23/2009 3 Laboratory for Computer Architecture

ICPP-38 2009

4

Monolithic vs NUCA vs Industry architectures

 Monolithic: One large shared uniform latency cache bank on a CMP

– Does not exploit physical locality for private data

– Slow for all

 CMP-NUCA: Typical proposal has a very large number of autonomous cache banks

– Very flexible (256 banks)

– Non optimal configuration

• Inefficient bank size (bank overhead)

 Real implementations

– Fewer banks in industry

– NUCA with discrete cache levels

– Key is wire assumptions made in original NUCA analysis

Core Core cache

Cache cache

Core Core cache

Cache cache cache

Cache cache

Core Core cache

Cache cache

Core Core

IBM POWER7

Intel Nehalem EX

Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Baseline System

5 Laboratory for Computer Architecture

 8 cores

 16 MB total capacity

– 16 x 1 MB banks

– 8 way associative

 Local Banks

– Tight latency to close core

 Center Banks

– Shared capacity

9/23/2009

ICPP-38 2009

Cache Partitioning/Profiling

6 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Cache Sharing/Partitioning

 Last level cache of CMP

– Once isolated resources now shared

• Drove need for isolation

– Design space

• Non-configurable

– Shared vs private caches

• Static partitioning/policy

– Long term policy choice

• Dynamic

– Real time profiling directed partitions

– Trial and error (experiment to find ideal configuration)

– Predictive profilers

> Non-invasive state space exploration (our system)

9/23/2009 7 Laboratory for Computer Architecture

ICPP-38 2009

Bank-aware cache partitions

 System components

– Non-invasive profiling using MSA (Mattson

Stack Algorithm)

– Cache allocation using marginal utility

– Bank-aware LLC partitions

8 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

MSA Based Cache Profiling

 Mattson stack algorithm

– Originally proposed to concurrently simulate many cache sizes

– Structure is a true LRU cache

– Stack distance from MRU of each reference is recorded

– Misses can be calculated for fraction of ways

9 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Hardware MSA implementation

 Naïve algorithm is prohibitive

Fully associative

Complete cache directory of maximum cache size for every core on the CMP (total size)

 Reduction

Set Sampling

Partial Tags

Maximal Capacity

 Configuration in paper

12 bit tag

1/32 set sampling

9/16 bank per core

– 0.4% overhead of cache on chip sets ways

10 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Marginal Utility

Miss rate relative to capacity is non-linear, and heavily workload dependant

 Dramatic miss rate reduction as data structures become cache contained

 In practice,

Iteratively assign cache to cores that produce the most hits per capacity

11 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Bank-aware LLC partitions

 a  ideal MSA model

 b  banked true LRU

– Cascade banks

– Power inefficient

 c  realistic banking

– Allocation policy

• Hash allocation

• Random allocation

– Bank granularity

• Uniform requirement

9/23/2009 12 Laboratory for Computer Architecture

ICPP-38 2009

Bank-aware allocation heuristics

 General idea

– As capacity grows, courser assignment is good enough

 Only share portions of

Local cache banks between neighbors

 Central banks are assigned to a specific core

– Any core to receive central banks is also assigned full local capacity

13 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Cache allocation flowchart

 Assign full cache banks first (steps 1-3)

– All cores that have multiple banks are complete

 Partition remaining local banks (steps 4-7)

– Fine tune assignment

– Sharing pairs

14 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Evaluation

15 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Methodology

 Workloads

– 8 cores running mix of 26 SPEC CPU 2000 workloads

– What benchmark mix?

– Typical is to classify with limited experiments

We wanted to cover a larger state space

 Monte Carlo

– Compare bank aware miss rate to ideal assignment

Show algorithm works for many cases

 Detailed simulation

– Cycle accurate

– Full system

• Simics+GEMS+CourseBanks+CachePartitions

9/23/2009 16 Laboratory for Computer Architecture

ICPP-38 2009

Monte Carlo

How close is Bank-aware assignment to ideal monolithic?

 Graphic shows miss rate reduction

– 1000 random

SpecCPU 2000 benchmark mixes

 97% correlation in miss rates

17 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Workload sets for detailed simulation

18 Laboratory for Computer Architecture 9/23/2009

ICPP-38 2009

Cycle accurate simulation

 Overall

– Miss ratio

• 70% reduction over shared

• 25% over equal

– Throughput

• 43% increase over shared

• 11% increase over equal

1.2

1

0.8

0.6

0.4

0.2

0

1.2

1

0.8

0.6

0.4

0.2

0

No-Partitions Equal-Partitions Bank-aware

Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM

No-Partitions Equal-Partitions Bank-aware

Set1 Set2 Set3 Set4 Set5 Set6 Set7 Set8 GM

9/23/2009 19 Laboratory for Computer Architecture

ICPP-38 2009

Conclusion/future work

 Significant miss rate reduction/throughput improvement possible

– Partitions are very important

– Marginal utility can work with realistic banked CMP caches

 Heterogeneous Benchmarks needed

– Can’t evaluate all combinations

– Hand chosen combinations are hard to compare across proposals

9/23/2009 20 Laboratory for Computer Architecture

ICPP-38 2009

Thank You,

Questions?

Laboratory for Computer Architecture

University of Texas Austin

&

IBM Austin

21 Laboratory for Computer Architecture 9/23/2009

Download