Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran

advertisement
Optimizing Shared Caches in Chip
Multiprocessors
Samir Sapra
Athula Balachandran
Ravishankar Krishnaswamy
Chip Multiprocessors??
“Just a few years ago, the idea of putting multiple processors
on a chip was farfetched. Now it is accepted and
commonplace, and virtually every new high performance
processor is a chip multiprocessor of some sort…”
Center for Electronic System Design
Univ. of California Berkeley
“Mowry is working on the development
of single-chip multiprocessors: one large
chip capable of performing multiple
operations at once, using similar
techniques to maximize performance”
-- Technology Review, 1999
Sony's Playstation 3, 2006
Core 2 Duo die
CMP Caches: Design Space
• Architecture
– Placement of Cache/Processors
– Interconnects/Routing
• Cache Organization & Management
– Private/Shared/Hybrid
– Fully Hardware/OS Interface
“L2 is the last line of defense before hitting the
memory wall, and is the focus of our talk”
Private L2 Cache
Proc
L1 I$ D$
L2 $
I
N
I$ D$ L1
L2 $
T
E
+ Less interconnect traffic
+ Insulates L2 units
+ Hit latency
Coherence
Protocol
L2 $
L2 $
R
C
O
N
L2 $
N
E
L2 $
C
– Duplication
– Load imbalance
– Complexity of coherence
Offchip Memory
– Higher miss rate
T
Shared-Interleaved L2 Cache
L1
Coherence Protocol
I$ D$
I
N
T
E
R
C
O
N
I$ D$
N
E
C
L2
+ No duplication
+ Balance the load
+ Lower miss rate
+ Simplicity of coherence
– Interconnect traffic
– Interference between cores
– Hit latency is higher
T
Take Home Message
• Leverage on-chip access time
Take Home Messages
•
•
•
•
•
•
Leverage on-chip access time
Better sharing of cache resources
Isolating performance of processors
Place data on the chip close to where it is used
Minimize inter-processor misses (in shared cache)
Fairness towards processors
On to some solutions…
Jichuan Chang and Gurindar S. Sohi
Cooperative Caching for Chip Multiprocessors
International Symposium on Computer Architecture, 2006.
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki
Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
International Symposium on Computer Architecture, 2009.
Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin
Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors
Architectural Support for Programming Languages and Operating, Systems 2008.
each handles this problem in a different way
Co-operative Caching
(Chang & Sohi)
• Private L2 caches
• Attract data locally to reduce remote on chip access.
Lowers average on-chip misses.
• Co-operation among the private caches for efficient
use of resources on the chip.
• Controlling the extent of co-operation to suit the
dynamic workload behavior
CC Techniques
• Cache to cache transfer of clean data
– In case of miss transfer “clean” blocks from another L2 cache.
– This is useful in the case of “read only” data (instructions) .
• Replication aware data replacement
– Singlet/Replicate.
– Evict singlet only when no replicates exist.
– Singlets can be “spilled” to other cache banks.
• Global replacement of inactive data
–
–
–
–
Global management needed for managing “spilling”.
N-Chance Forwarding.
Set recirculation count to N when spilled.
Decrease N by 1 when spilled again, unless N becomes 0.
Set “Pinning” -- Setup
Processors
P1
P2
Shared
L2 cache
L1
cache
Set 0
Set 1
:
:
P3
Set (S-1)
P4
I
n
t
e
r
c
o
n
n
e
c
t
Main
Memory
Set “Pinning” -- Problem
P1
P2
Set 0
Set 1
:
:
P3
Set (S-1)
P4
Main
Memory
Set “Pinning”
-- Types of Cache Misses
• Compulsory
(aka Cold)
• Capacity
• Conflict
• Coherence
versus
• Compulsory
• Inter-processor
• Intra-processor
Owner
Other bits
Data
Set
P1
:
:
Set
P2
Main
Memory
POP 1
P3
POP 2
POP 3
P4
POP 4
R-NUCA: Use Class-Based Strategies
Solve for the common case!
Most current (and future) programs have the following types of accesses
1. Instruction Access – Shared, but Read-Only
2. Private Data Access – Read-Write, but not Shared
3. Shared Data Access – Read-Write (or) Read-Only, but Shared.
R-NUCA: Can do this online!
• We have information from the OS and TLB
• For each memory block, classify it as
– Instruction
– Private Data
– Shared Data
• Handle them differently
– Replicate instructions
– Keep private data locally
– Keep shared data globally
R-NUCA: Reactive Clustering
• Assign clusters based on level of sharing
– Private Data given level-1 clusters (local cache)
– Shared Data given level-16 clusters (16 neighboring machines), etc.
Clusters ≈ Overlapping Sets in Set-Associative Mapping
• Within a cluster, “Rotational Interleaving”
– Load-Balancing to minimize contention on bus and controller
Future Directions
Area has been closed.
Just Kidding…
• Optimize for Power Consumption
• Assess trade-offs between more caches and more cores
• Minimize usage of OS, but still retain flexibility
• Application adaptation to allocated cache quotas
• Adding hardware directed thread level speculation
Questions?
THANK YOU!
Backup
• Commercial and research prototypes
– Sun MAJC
– Piranha
– IBM Power 4/5
– Stanford Hydra
Backup
Download