The Locality-Aware Adaptive Cache Coherence Protocol

advertisement
The Locality-Aware Adaptive
Cache Coherence Protocol
George Kurian1, Omer Khan2, Srini Devadas1
1 Massachusetts
Institute of Technology
2 University of Connecticut, Storrs
1
Cache Hierarchy Organization
Directory-Based Coherence
• Private caches: 1 or 2 levels
• Shared cache: Last-level
Private cache
Write
Writeword
miss
1
3
Sharer
Shared
Cache + 2
Directory
• Concurrent reads lead to
replication in private caches
• Directory maintains coherence
for replicated lines
4
Sharer
2
Private Caching
Advantages & Drawbacks
☺ Exploits spatiotemporal locality
☹ Inefficiently handles data
with LOW spatio-temporal
locality
☺ Efficient low-latency
local access to private
+ shared data (cache
☹ Working set > private cache size
line replication)
☹ Inefficient cache utilization
(Cache thrashing)
☹ Unnecessary fetch of entire
cache line
☹ Shared data replication
3
increases working set
Private Caching
Advantages & Drawbacks
☺ Exploits spatiotemporal locality
☹ Inefficiently handles data
with LOW spatio-temporal
locality
☺ Efficient low-latency
local access to private
+ shared data
(cache communication and time
Increased
on-chip
☹ Working set > private cache size
line replication)
spent waiting for expensive events
☹ Shared data with frequent
writes
☹ Wasteful invalidations,
synchronous writebacks,
cache line ping-ponging
4
On-Chip Communication Problem
Bill Dally, Stanford
Shekhar Borkar, Intel
 Wires relative to gates are getting worse every generation
Bit movement is much more expensive than computation
Must Architect Efficient Coherence Protocols
5
Locality of Benchmarks
Evaluating Reuse before Evictions
80%
20%
• Utilization: # private L1 cache accesses before
cache line is evicted
• 40% of lines evicted have a utilization < 4
6
Locality of Benchmarks
Evaluating Reuse before Invalidations
80%
10%
• Utilization: # private L1 cache accesses before
cache line is invalidated (intervening write)
7
• Assign each memory address
to unique “home” core
– Cache line present only in
shared cache at “home” core
(single location)
• For access to non-locally
cached word, request
“remote” shared cache on
“home” core to perform the
read/write access
Home core
Remote-Word Access (RA)
2
1
Write word
NUCA-based protocol
[Fensch et al HPCA’08]
[Hoffmann et al HiPEAC’10]
8
Remote-Word Access
Advantages & Drawbacks
☺ Energy Efficient
(low locality data) 
Word access (~200 bits)
cheaper than cache line
fetch (~640 bits)
☺ NO data replication 
Efficient private cache
utilization
☺ NO invalidations /
synchronous writebacks
☹ Round-trip network
request for remoteWORD access
☹ Expensive for high
locality data
☹ Data placement
dictates distance &
frequency of remote
accesses
9
Locality-Aware Cache Coherence
• Combine advantages of private caching and
remote access
• Privately cache high locality lines
– Optimize hit latency and energy
• Remotely cache low locality lines
– Prevent data replication & costly data movement
• Private Caching Threshold (PCT)
– Utilization >= PCT  Mark as private
– Utilization < PCT  Mark as remote
10
Locality-Aware Cache Coherence
• Private Caching Theshold (PCT) = 4
Invalidations Breakdown (%)
1
2,3
4,5
6,7
>=8
100%
80%
Private
60%
40%
20%
Remote
0%
Invalidations vs Utilization
11
Outline
• Motivation for Locality-Aware Coherence
• Detailed Implementation
• Optimizations
• Evaluation
• Conclusion
12
Baseline System
Core
M
Compute Pipeline
L1 D-Cache
M
M
L1 I-Cache
L2 Shared Cache
Directory
Router
• Compute pipeline
• Private L1-I and L1-D caches
• Logically shared physically distributed L2 cache with
integrated directory
• L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09]
• ACKwise limited-directory protocol [Kurian – PACT10]
13
Locality-Aware Coherence
Important Features
• Intelligent allocation of cache lines
– In the private L1 cache
– Allocation decision made per-core at cache line level
• Efficient locality tracking hardware
– Decoupled from traditional coherence tracking
structures
• Protocol complexity low
– NO additional networks for deadlock avoidance
14
Implementation Details
Private Cache Line Tag
State
LRU
Tag
Private
Utilization
• Private Utilization bits to track cache line
usage in L1 cache
• Communicated back to directory on eviction
or invalidation
• Storage overhead is only 0.4%
15
Implementation Details
Directory Entry
State
ACKwise
Pointers
1…p
Tag
P/R1
Remote
Utilization1
…
…
P/Rn
Remote
Utilizationn
• P/Ri: Private/Remote Mode
• Remote-Utilizationi: Line usage by Corei at
shared L2 cache
• Complete Locality Classifier: Track
mode/remote-utilization for all cores
• Storage overhead reduced later
16
Mode Transitions Summary
• Classification based on previous behavior
Remote Utilization < PCT
Private Utilization < PCT
Initial
Private
Private Utilization >= PCT
Remote
Remote Utilization >= PCT
17
Walk Through Example
Core A
Private Caching
Threshold
PCT = 2
Pipeline +
L1 Cache
Network
Core B
Pipeline +
L1 Cache
All cores start out
in private mode
Core C
Pipeline +
L1 Cache
Uncached
Directory
Core-A Core-B Core-C
Private
Private
Private
U
U
U
Core D
L2 Cache +
Directory
18
Walk Through Example
Core A
PCT = 2
Read[X]
Core B
Core C
Uncached
Directory
Core-A Core-B Core-C
Private
Private
Private
U
U
U
Core D
19
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared
Directory
Cache Line [X]
Core-A
Core-B Core-C
Private
Private
Private
C
U
U
Core D
Clean
-
20
Walk Through Example
Core A
PCT = 2
Shared 1
Cache Line [X]
Core B
Core C
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
U
Core D
Clean
-
21
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Read[X]
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
U
Core D
Clean
-
22
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
Directory
Cache Line [X]
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
23
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Cache Line [X]
Shared 1
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
24
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
1
Read[X]
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
25
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
26
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
Write[X]
2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
27
Walk Through Example
Core A
PCT = 2
Shared
1
Core B
Core C
Shared
Shared
Directory
2
Inv [X]
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
28
Walk Through Example
Core A
PCT = 2
Invalid 0
Inv-Reply [X]
(1)
Core B
Core C
Shared
2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
U
C
Core D
Clean
-
29
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared
Inv-Reply [X]
(1)
Directory
Shared
2
Core-A Core-B Core-C
Remote
Private
Private
0
U
C
Core D
Clean
-
30
Walk Through Example
Core A
PCT = 2
Core B
Core C
Inv-Reply [X]
(2)
Invalid 0
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
U
C
Core D
Clean
-
31
Walk Through Example
Core A
PCT = 2
Core B
Core C
Inv-Reply [X]
(2)
Uncached
Directory
Core-A Core-B
Core-C
Remote
Private
Private
0
U
U
Core D
Clean
-
32
Walk Through Example
Core A
PCT = 2
Core B
Core C
Modified
Directory
Cache Line [X]
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
33
Walk Through Example
Core A
PCT = 2
Core B
Modified 1
Core C
Cache Line [X]
Modified
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
34
Walk Through Example
Core A
PCT = 2
Read[X]
Core B
Modified
Core C
1
Modified
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
35
Walk Through Example
Core A
PCT = 2
Core B
Modified
Core C
1
Modified
Directory
WB [X]
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
36
Walk Through Example
Core A
PCT = 2
Core B
Shared 1
Core C
WB-Reply [X]
Modified
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Clean
-
37
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 1
Shared
Directory
WB-Reply [X]
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
38
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 1
Shared
Directory
Word [X]
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
39
Walk Through Example
Core A
PCT = 2
Core B
Shared 1
Core C
Write [X]
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
40
Walk Through Example
Core A
PCT = 2
Core B
Shared
Core C
1
UpgradeReply [X]
Modified
Directory
Core-A
Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
41
Walk Through Example
Core A
PCT = 2
Core B
Core C
Modified 2
Modified
Directory
Core-A
Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
42
Walk Through Example
Core A
PCT = 2
Read [X]
Core B
Core C
Modified 2
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
0
C
U
Core D
Dirty
-
43
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Shared
Directory
Read [X]
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
44
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Shared
Directory
Word [X]
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
45
Walk Through Example
Core A
PCT = 2
Read [X]
Core B
Core C
Shared 2
Shared
Directory
Core-A Core-B Core-C
Remote
Private
Private
1
C
U
Core D
Dirty
-
46
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Shared
Directory
Read [X]
Core-A Core-B Core-C
Remote
Private
Private
2
C
U
Core D
Dirty
-
47
Walk Through Example
Core A
PCT = 2
Core B
Core C
Shared 2
Cache Line [X]
(2)
Shared
Directory
Core-A
Core-B Core-C
Private
Private
Private
C
C
U
Core D
Dirty
-
48
Walk Through Example
Core A
PCT = 2
Shared 2
Cache Line [X]
(2)
Core B
Core C
Shared 2
Shared
Directory
Core-A Core-B Core-C
Private
Private
Private
C
C
U
Core D
Dirty
-
49
Outline
•
•
•
•
•
Motivation for Locality-Aware Coherence
Detailed Implementation
Optimizations
Evaluation
Conclusion
50
Complete Locality Classifier
High Directory Storage
State
ACKwise
Pointers
1…p
Tag
P/R1
Remote
Utilization1
…
…
P/Rn
Remote
Utilizationn
• Complete Locality Classifier: Tracks locality
information for all cores
Classifier
Complete
Bit Overhead per core 192 KB (60%)
(256 KB L2)
51
Limited Locality Classifier
Reduces Directory Storage
State
ACKwise
Pointers
1…p
Tag
Core ID1
…
Core IDk
P/R1
Remote
Utilization1
…
P/Rk
Remote
Utilizationk
…
• Utilization and mode tracked for k sharers
• Modes of other sharers obtained by taking a
majority vote
52
Limited-3 Locality Classifier
• Utilization and mode tracked for 3 sharers
Metric
Limited-3 vs Complete
Achieves
the
performance
and
energy
Completion Time
3 % lower
of the Complete1.5locality
Energy
% lower classifier
• CT and Energy lower because remote mode
classification learned faster with Limited-3
Classifier
Complete
Bit Overhead per core 192 KB (60%)
(256 KB L2)
Limited-3
18 KB (5.7%)
53
Private <-> Remote Transition
Results In Private Cache Thrashing
• Difficult to measure private cache locality of
line in shared L2 cache
Initial
Private Utilization < PCT
Private
Private Utilization >= PCT
Remote Utilization < PCT
Remote
Remote Utilization >= PCT
• Core reverts back to private mode after #PCT
accesses to cache line at shared L2 cache
• Evicts other lines in the private L1 cache
• Results in low spatio-temporal locality for all 54
Ideal Classifier
NO Private Cache Thrashing
• Ideal classifier maintains part of the working
set in the private cache
• Other lines placed in remote mode at shared
cache
55
Remote Access Threshold
Reduces Private Cache Thrashing
• If core classified as remote sharer (capacity),
increase cost of promotion to private mode
• If core classified as private sharer, reset the
Reduces
private
cache
thrashing
to
a
cost back to its starting value
Initial
negligible levelRemote Utilization < RAT
Private Utilization < PCT
Private
Private Utilization >= PCT
Remote
Remote Utilization >= RAT
• Remote Access Threshold (RAT) varied based on
PCT & application behavior [details in paper] 56
Outline
•
•
•
•
•
Motivation for Locality-Aware Coherence
Implementation Details
Optimizations
Evaluation
Conclusion
57
Reducing Capacity Misses
Private L1 Cache Miss Rate vs PCT (Blackscholes)
Cache Miss Rate
Breakdown (%)
Cold
Capacity
Upgrade
Sharing
Word
3
2
1
0
1
2
3
4
5
6
PCT
7
8
• Miss rate reduces as PCT increases (better utilization)
• Multiple capacity misses (expensive) replaced with
single word access (cheap)
• Cache miss rate increases towards the end
(one capacity miss turns into multiple word misses) 58
Energy vs PCT
Blackscholes
Energy (normalized)
1.2
1
Network Link
0.8
Network Router
0.6
Directory
0.4
L2 Cache
L1-D Cache
0.2
L1-I Cache
0
1
2
3
4
5
6
PCT
7
8
• Reducing L1 cache misses (& Capacity  Word)
lead to lesser network traffic and L2 accesses
• Accessing a word (200 bits) cheaper than
59
fetching the entire cache line (640 bits)
Completion Time vs PCT
Blackscholes
Completion Time
(normalized)
1.2
1
Synchronization
0.8
L2Cache-OffChip
0.6
L2Cache-Sharers
0.4
L2Cache-Waiting
L1Cache-L2Cache
0.2
Compute
0
1
2
3
4
5
6
7
8
• Lower L1 cache miss rate + miss penalty
• Less time spent waiting on L1 cache misses
60
Reducing Sharing Misses
Private L1 Cache Miss Rate vs PCT (Streamcluster)
Cache Miss Rate
Breakdown (%)
Cold
Capacity
Upgrade
Sharing
Word
8
6
4
2
0
1
2
3
4
5
6
7
8
PCT
• Sharing misses (expensive) turned into word
misses (cheap) as PCT increases
61
Energy vs PCT
Energy (normalized)
Streamcluster
1.2
1
Network Link
0.8
Network Router
0.6
Directory
0.4
L2 Cache
L1-D Cache
0.2
L1-I Cache
0
1
2
3
4
5
6
7
PCT
8
• Reduce invalidations, asynchronous writebacks and cache-line ping-pong’ing
62
Completion Time vs PCT
Streamcluster
Completion Time
(normalized)
1.2
1
Synchronization
0.8
L2Cache-OffChip
0.6
L2Cache-Sharers
0.4
L2Cache-Waiting
L1Cache-L2Cache
0.2
Compute
0
1
2
3
4
5
6
7
PCT
8
• Less time spent waiting for invalidations and
invalidations and by loads waiting for previous stores
• Critical section time reduction -> synchronization time
63
reduction
Variation with PCT
Results Summary
• Evaluated 18 benchmarks from the SPLASH-2,
PARSEC, parallel-MI bench and UHPC suites +
3 hand-written benchmarks
• PCT of 4 obtains 25% reduction in energy and
15% reduction in completion time
• Evaluations done using Graphite simulator for
64 cores, McPAT/CACTI cache energy models
and DSENT network energy models at 11 nm
64
Conclusion
• Three potential advantages of the locality-aware
adaptive cache coherence protocol
– Better private cache utilization
– Reduced on-chip communication (invalidations, asynchronous
write-backs and cache-line transfers)
– Reduced memory access latency and energy
• Efficient locality tracking hardware
• Decoupled from traditional coherence tracking structures
• Limited3 locality classifier has low overhead of 18KB per-core
(with 256KB per-core L2 cache)
• Simple to implement
– NO additional networks for deadlock avoidance
65
Download