Cooperative Caching for Chip Multiprocessors

advertisement
Cooperative Caching for Chip
Multiprocessors
Jichuan Chang and Gurindar S. Sohi
University of Wisconsin-Madison
A KTEC Center of Excellence
1
Review of CPU caching
• Block size
• Direct vs. n-way set associative vs. fully
associative
• Multiple layers (L1, L2, L3, …)
• Harvard Architecture (Data/Instruction L1)
• Three C’s (Compulsory, Capacity, Conflict)
• Write policy (write through, write back)
• Use of virtual vs. physical addresses
• Replacement policies (LRU, random, etc.)
• Special types (TLB, Victim)
• Prefetching
A KTEC Center of Excellence
2
Review of Cache Coherency
• Snooping vs. directory-based
• MSI, MESI, MOSI, MOESI
• Can often transfer dirty data from cache
to cache
• Clean data is more difficult because it
does not have an “owner”
• Inclusion vs Exclusion
• Performance issues related to coherency
• Mutex example
A KTEC Center of Excellence
3
Chip-level Multithreading (CMP)
• Multiple CPU cores on a single chip
• Different from hardware multithreading (MT)
• Fine-grained, Course-grained, SMT
• Becoming popular in industry with Intel Core 2
Duo, AMD X2, UltraSPARC T1, IBM Xenon, Cell
• A common memory architecture is an L1 cache
per core and a shared L2 for all cores on chip
• Each core can use entire L2 cache
• Another organization is private L2 caches per
core
• Lower latency to L2 cache and simpler design
• L2 cache contention can become a problem for
memory bound threads
A KTEC Center of Excellence
4
Goals of CMP Caching
• Must be scalable!!!
• Reduce off-chip transactions
• Expensive and getting worse
• Reduce side effects between cores
• They are running different computations and should
not severely effect their neighbors if memory bound
• Reduce latency
• The main goal of caching
• Latency of shared on-chip cache becomes a problem
for high clock speeds
A KTEC Center of Excellence
5
Cooperative Caching
• Each core has its own private L2 cache
but there is additional logic in the cache
controller protocol to allow the private
L2 caches to act as an aggregate cache
for the chip.
• Goal is to achieve both the low latency of
private L2 caches and the low off-chip
miss rate of shared L2 caches.
• Adopted from file server and web caches
(where remote operations are expensive)
A KTEC Center of Excellence
6
Methods of Cooperative Caching
• Private L2 cache is the baseline
• Reduce off chip accesses
• Victim data does not get written off chip
• It is placed in a neighbor’s cache (capacity stealing)
• Did not apply to old SMP systems
– Talking to a neighbor process as expensive as
talking to memory
– Not true for CMP
• Can dynamically control the amount of
cooperation
A KTEC Center of Excellence
7
Reducing off chip accesses
• Cache-to-cache transfers of clean data
• Most cache coherence protocols do not allow for this
• Dirty data can be transferred cache-to-cache because
it has a known owner
• Clean data may be in more than one place, therefore it
complicates the coherence protocol to assign an owner
to clean data
• Result is that clean data transfers for coherence
often go through the next level down (SLOW!)
• Extra complexity is worth it in CMP because going to
the next level in the memory hierarchy is expensive
• They claim sharing clean data can result in a 10-50%
reduction in off-chip misses
A KTEC Center of Excellence
8
Reducing off chip accesses
• Replication aware data replacement
• Private cache method results in multiple copies of the same
data
• When selecting a victim, picking something that has a
duplicate on chip (a “replicate”) is better than picking
something that is unique on the chip (a “singlet”)
• The reason is that if the singlet is needed again in the
future, then an off-chip access is required to get it back
• Must complicate coherence protocol to keep track of
replicates and singlets
– Once again, they claim it is worth it for CMP
• If all potential victims are singlet, they use LRU
• Victims can spill to a neighbor cache using a weighted random
selection algorithms that favors nearest neighbors
A KTEC Center of Excellence
9
Reducing off chip accesses
• Global replacement of inactive data
• Want something like LRU for aggregate caches
• Difficult to implement because each cache is technically
private leading to synchronization problems
• They use N-chance forwarding to handle global replacement
policy (bottom of page 4)
– Each block has a recirculation count
– When a singlet block is selected, its recirculation count
is set to N
– Each time that block is evicted, its recirculation count is
decremented
– When the count reaches 0, it goes to main memory
– When the block is accessed, recirculation count is reset
– They have N=1 for CMP Cooperative Caching simulations
A KTEC Center of Excellence
10
Cooperative throttling
• Can modify cooperation probability
parameter in spill algorithm to throttle
amount of cooperation
• Picks between replication aware method and basic LRU
method
• Lower cooperation probability means the caches act
more like private L2s
• High cooperation probability means the caches act
more like shared L2
A KTEC Center of Excellence
11
Hardware implementation
• Need extra bits to keep track of state needed
on previous slides
• Singlet bit
• Recirculation count
• Spilling method can be push or pull
• Push sends victim data directly to other cache
• Pull sends a request to other cache and then it performs a
read operation
• Snooping requires too much overhead for
monitoring private caches
• They choose a centralized directory based
protocol similar to MOSI
• Might have scaling issues
• They speculate having clusters of directories is a solution to
scaling to hundreds of cores, but do not go any deeper
A KTEC Center of Excellence
12
Central Coherence Engine (CCE)
• Holds the directory and other
centralized coherence logic
• Ever read miss sent to directory,
directory says which private cache has
data
• Must keep track of L1 and L2 tags due to
non-inclusion between L1 and local L2 for
each core
• Inclusion is between L1 and the aggregate cache
instead
A KTEC Center of Excellence
13
CCE (Continued)
• Picks a clean owner for a block to handle
the cache-to-cache transfer of clean
data
• CCE must be updated when a clean copy is evicted
from a private L2
• Implements push-based spilling by
working with private caches
• Write back from cache 1 transfers data
to L2, CCE then picks a new cache for
data and transfers it to new host cache
A KTEC Center of Excellence
14
Performance evaluation
• Go over section 4 of paper
A KTEC Center of Excellence
15
Related Work
• CMP-NUCA
• CMP-NuRAPID
• Victim Replication
A KTEC Center of Excellence
16
Conclusion
• Cooperative caching can reduce the
runtime of simulated workloads by 438%, and performs at worst 2.2% slower
than the best of private and shared
caches in extreme cases.
• Power utilization (by turning off private
L2s) and performance isolation (reducing
side effects between cores) are left as
future work
A KTEC Center of Excellence
17
Download