A Bandwidth-aware Memory-subsystem Resource Management

advertisement
A Bandwidth-aware Memory-subsystem Resource
Management using Non-invasive Resource Profilers for
Large CMP Systems
Dimitris Kaseridis, Jeffery Stuecheli, Jian Chen and Lizy K. John
Department of Electrical and Computer Engineering, The University
of Texas at Austin, TX, USA, IBM Corp., Austin, TX, USA
Reviewed by: Stanley Ikpe
Overview
 Terminology
 Paper
Breakdown
 Paper Summary



Objective
Implementation
Results
 General
Comments
 Discussion Topics
Terminology







Chip Multiprocessor (CMP): multiple processor cores
on a single chip
Throughput: measure of work done; successful
messages delivered
Bandwidth (memory): rate at which data can be
read/stored
Quality of Service (QoS): ability to provide priority to
applications
Fairness: ability to allocate resources
Resources: utilities used for work (cache capacity
and memory bandwidth)
Last Level Cache (LLC): largest (slow) cache memory
(on or off chip)
Paper Breakdown



Motivation: CMP integration provides
opportunity for improved throughput.
Adversely, sharing resources can be
hazardous to performance.
Causes: Parallel Applications; each thread
(core) puts different demands/requests on
common (shared) resources.
Effects: Inconsistencies in performance,
resource contention (unfairness).
Paper Breakdown
 So
how do we fix this??
 Resource Management: control the
allocation and use of available resources.
 What are some of these resources?
 Cache
capacity
 Available memory bandwidth
Paper Breakdown
 How
do we go about resource
management??
 Predictive work monitoring: intuitively infer
what resources will be used. Non-invasive
(hardware) method of profiling resources
(cache capacity and memory
bandwidth)
 System-wide resource allocation and job
scheduling by identifying over-utilized
CMPs (BW) and reallocate work.
Baseline Architecture
Set-Associative Design
[3] www.utdallas.edu/~edsha/parallel/2010S/Cache-Overview.pdf
Objectives
 Create
an algorithm to effectively project
memory bandwidth and cache capacity
requirements (per core).
 Implement for system-wide optimization of
resource allocation and job scheduling.
 Improve potential throughput for CMP
systems.
Implementation
 Resource
Profiling: prediction scheme to
detect cache misses and bandwidth
requirements


Mattson’s stack distance algorithm (MSA):
method for reducing the simulation time of
trace-driven caches. (Mattson et al. [2])
MSA-based profiler for LLC misses: K-way set
associative cache implies K+1 counters. Cache
access at position i increments counter i. If
cache miss increment counter K+1.
MSA-based profiler for LLC
misses
Implementation

MSA-based Profiler for Memory Bandwidth: the
memory bandwidth required to read (due to
cache fills) and write (due to cache dirty writebacks to main memory)
•Hits to dirty cache lines indicate write-back operations if
cache capacity allocation < stack distance.
•Dirty Stack Distance used to track largest stack distance at
which a dirty line accessed
•Dirty counter projects write-back rate and Dirty bit marks
the greatest stack distance of dirty line
Write-back pseudocode
Write-back Profiling Example
SPEC CPU 2006
Implementation

Resource Allocation: compute Marginal-Utility for a
given workload across a range of possible cache
allocations to compare all possible allocations of
unused capacity (n new elements, c already used
elements)
Intra-chip partitioning algorithm: Marginal-Utility figure of merit measuring
amount of utility provided (reduced cache misses) for a given amount of
resource (cache capacity). Algorithm considers ideal cache capacity
and distributes specific cache-ways per core.
Algorithm
Implementation

Inter-chip partitioning algorithm: find an efficient (below
threshold or bandwidth limit) workload schedule on all
available CMPs in system. A global implementation is used
to mitigate misdistribution of workload. Marginal-Utility
algorithm along side bandwidth over-commit detection
allow additional workload migration
•Cache capacity: estimate optimal resource assignment
(marginal-utility) and intra-chip partitioning assignment.
Algorithm performs workload swapping so each core is below
bandwidth limit.
•Memory Bandwidth: Memory bandwidth over-commit
algorithm finds workloads with high/low requirements and does
shifting to undercommitted CMPs
Algorithm
Example
Resource Management
Scheme
Results



LLC misses: 25.7% average
reduction from static-even
partitions (with 1.4% storage
overhead associated)
BW-aware algorithm shows
improvement up until 8 CMP
implementation (beyond
shows diminishing returns)
Miss rates consistent across
different cache sizes with
slight improvement due to
increased possible cache
ways and hence potential
workload swapping
candidates
Results



Memory Bandwidth: reduction
of the average worst-case
chip memory bandwidth in the
system (per epoch).
Figure of merit used is long
memory latencies associated
with overcommitted memory
bandwidth requirements by
specific CMPs
UCP+ algorithm (MarginalUtility/Intra-chip) shows
average of 19% improvement
over static-even. (Also
increases with number of CMPs
due to random workload
selection of average worstcase bandwidth .
Results

Simulated Throughput: used
to measure the effectiveness
of implementation




Case 1: Use of only UCP+
Case 2: Addition of Inter-chip
(workload swapping) BW-aware
algorithm
Case 1 shows 8.6% IPC and
15.3 MPKI improvements on
Chip 4 and 7. (swapping high
memory bandwidth
benchmarks for lesser
demanding ones)
Case 2 shows 8.5% IPC and
11% MPKI improvements due
to workload migration of
overcommitted chip 7.
Comments
 No
detailed hardware implementation of
“non-invasive” profilers
 “Large” CMP systems not demonstrated
due to complexity
 Good implementation of resource
management


Design limited (additional cores)
Cache designs (other than set-associative)
References



[1] D. Kaseridis, J. Stuecheli, J. Chen and L. K. John,
“A Bandwidth-aware Memory-subsystem Resource
Management using Non-invasive Resource Profilers
for Large CMP Systems”.
[2] R. L. Mattson, “Evaluation techniques for storage
hierarchies”. IBM Systems Journal, 9(2):78-117, 1970.
[3] www.utdallas.edu/~edsha/parallel/2010S/CacheOverview.pdf
Discussion Topics
 How
can an inter-board partitioning
algorithm be implemented? Is it
necessary?
 What causes diminished returns beyond 8
CMP chips? Can it be circumvented?
Download