Locality-aware Cache Hierarchy Management for ARCHIVES

advertisement
Locality-aware Cache Hierarchy Management for ARCHIVES
Multicore Processors
MASSACHUSETTS INTIT(JTE
OF rECH4N0LOL(_Y
lIAR 19
20151
by
George Kurian
B.Tech., Indian Institute of Technology, Chennai (2008
S.M., Massachusetts Institute of Technology (2010)
LIBRARIES
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2015
Massachusetts Institute of Technology 2015. All rights reserved.
Signature redacted
A u th or ..............................................
Department of Electrical Engineering and Computer Science
December 30, 2014
Signature redacted
C ertified by .....................................
Srinivas Devadas
Professor
Thesis Supervisor
Accepted by .............................
Signature redacted
LAid A. Kolodziejski
Chair, Department Committee on Graduate Students
/
Locality-aware Cache Hierarchy Management for Multicore
Processors
by
George Kurian
Submitted to the Department of Electrical Engineering and Computer Science
on December 30, 2014, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
Next generation multicore processors and applications will operate on massive data
with significant sharing. A major challenge in their implementation is the storage
requirement for tracking the sharers of data. The bit overhead for such storage
scales quadratically with the number of cores in conventional directory-based cache
coherence protocols.
Another major challenge is limited cache capacity and the data movement incurred
by conventional cache hierarchy organizations when dealing with massive data scales.
These two factors impact memory access latency and energy consumption adversely.
This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality
and controlling replication.
First, a limited directory-based protocol, ACKwise is proposed to track the sharers of data in a cost-effective manner. ACKwise leverages broadcasts to implement
scalable cache coherence. Broadcast support can be implemented in a 2-D mesh network by making simple changes to its routing policy without requiring any additional
virtual channels.
Second, a locality-aware replication scheme that better manages the private caches
is proposed. This scheme controls replication based on data reuse information and
seamlessly adapts between private and logically shared caching of on-chip data at the
fine granularity of cache lines. A low-overhead runtime profiling capability to measure
the locality of each cache line is built into hardware. Private caching is only allowed
for data blocks with high spatio-temporal locality.
Third, a Timestamp-based memory ordering validation scheme is proposed that
enables the locality-aware private cache replication scheme to be implementable in
processors with out-of-order memory that employ popular memory consistency models. This method does not rely on cache coherence messages to detect speculation
violations, and hence is applicable to the locality-aware protocol. The timestamp
mechanism is efficient due to the observation that consistency violations only occur
due to conflicting accesses that have temporal proximity (i.e., within a few cycles of
3
each other), thus requiring timestamps to be stored only for a small time window.
Fourth, a locality-aware last-level cache (LLC) replication scheme that better
manages the LLC is proposed. This scheme adapts replication at runtime based
on fine-grained cache line reuse information and thereby, balances data locality and
off-chip miss rate for optimized execution.
Finally, all the above schemes are combined to obtain a cache hierarchy replication
scheme that provides optimal data locality and miss rates at all levels of the cache
hierarchy. The design of this scheme is motivated by the experimental observation
that both locality-aware private cache & LLC replication enable varying performance
improvements across benchmarks.
These techniques enable optimal use of the on-chip cache capacity, and provide
low-latency, low-energy memory access, while retaining the convenience of shared
memory and preserving the same memory consistency model. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication
improves completion time by 15% and energy by 22% over a state-of-the-art baseline
while incurring a storage overhead of 30.7 KB per core. (i.e., 10% the aggregate cache
capacity of each core).
Thesis Supervisor: Srinivas Devadas
Title: Professor
4
Acknowledgments
I would like to thank my advisors at MIT, Prof. Srinivas Devadas and Prof. Anant
Agarwal for their enthusiastic support towards my research, both through technical
guidance and funding. I would like to thank Prof. Devadas for his sharp intellect and
witty demeanor which made group meetings fun and lively. I would like to thank him
for sitting patiently through my many presentations and providing useful feedback
that always made my talk better. I would also like to thank him for accepting to be
my mentor midway through my PhD career when Prof. Agarwal had to leave to join
EdX, and also allowing me to explore many research topics before finally deciding
one worthy of serious consideration.
I would like to thank Prof. Agarwal for his constant encouragement and lively
personality which puts anyone who meets with him immediately at ease. I would like
to thank him for all the presentation skills I have learnt from him which have enabled
me to plan the best use of my time during a talk. I would also like to thank him
for enabling me to get adjusted to MIT and USA during my initial days at graduate
school.
I would like to thank Prof. Omer Khan (University of Connecticut, Storrs) for
constantly providing fuel for my research and helping me remain competitive. I would
like to thank him for introducing me to many researchers at conferences. I would also
like to thank him for being my partner in crime during numerous all-nighters before
conference deadlines.
I would like to thank Prof.
Daniel Sanchez for being a member on my thesis
committee and for all the feedback on the initial draft of my thesis. I would also like
to thank him for sharing the ZSim simulator code from which I have extracted the
processor decoder models used in my work.
I would like to thank Jason Miller for all the discussions I had with him and for
his willingness to explain technical details clearly, precisely and patiently. I would
also like to thank him for the valuable feedback he gave for many presentations and
for the many discussion sessions on the Graphite simulator. I would also like to thank
5
Nathan Beckmann, Harshad Kasture and Charles Gruenwald for the camaraderie and
all the brainstorming sessions during the Graphite project. I would like to thank the
Hornet group for all the informative presentations and discussions during the group
meetings.
I would like to thank Cree Bruins for all the support and for always being willing
to help when needed. I would like to thank my close set of friends at Tang Hall for
the friendship and support extended to me at MIT and for providing a venue for
relaxation and enjoyment outside of research and courses.
Last but not the least, I would like to thank my parents and my brother for the
love and support extended to me throughout my life and for making me the person I
am today. I would like to thank them for supporting my decision to go to graduate
school and for tolerating my absence from home for most of the year.
6
Contents
21
1.1.1
Cache Management . . . . . . . . . . . . . .
. . . . . . . . .
22
1.1.2
Single-core Performance
. . . . . . . . . . .
. . . . . . . . .
23
.
.
. . . . . . . . .
Programmability and Memory Models
. . . . . . .
. . . . . . . . .
24
1.3
Thesis Contributions . . . . . . . . . . . . . . . . .
. . . . . . . . .
25
1.3.1
ACKwise Directory Coherence Protocol . . .
. . . . . . . . .
25
1.3.2
Locality-aware Private Cache Replication [54] . . . . . . . . .
26
1.3.3
Timestamp-based Memory Ordering Validation
. . . . . . . .
27
1.3.4
Locality-aware LLC Replication [53]
. . . . . . . . .
28
1.3.5
Locality-aware Cache Hierarchy Replication
. . . . . . . . .
29
1.3.6
Results Summary . . . . . . . . . . . . . . .
. . . . . . . . .
29
Organization of Thesis . . . . . . . . . . . . . . . .
. . . . . . . . .
30
.
.
.
1.2
.
.
. . . .
.
1.4
3
Performance and Energy Efficiency . . . . . . . . .
.
1.1
31
Evaluation Methodology
Baseline Architecture . . . . . . . . . . . . . . . . .
. . . . . . . . .
31
2.2
Performance Models
. . . . . . . . . . . . . . . . .
. . . . . . . . .
32
2.3
Energy M odels
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
33
2.4
ToolFlow . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
34
2.5
Application Benchmarks
. . . . . . . . . . . . . . .
. . . . . . . . .
34
.
.
.
.
2.1
.
2
21
Introduction
ACKwise Directory Coherence Protocol
3.1
Protocol Operation . . . . . . . . . . .
.
1
7
37
38
Silent Evictions . . . . . . . . . . . . . . . . . . .
39
3.3
Electrical Mesh Support for Broadcast
. . . . . .
39
3.4
Evaluation Methodology
. . . . . . . . . . . . . .
41
3.5
R esults . . . . . . . . . . . . . . . . . . . . . . . .
41
.
.
.
3.5.2
Sensitivity to Broadcast Support
3.5.3
Comparison to DirkNB [61 Organization
Sum m ary
(k)
41
.
Sensitivity to Number of Hardware Pointers
44
. . . . . . . . . . . .
45
. . . . . . . . . . . .
46
. . . . .
. . . . . . . . . . . . . . . . . . .
47
Locality-aware Private Cache Replication
Motivation . . . . . . . . . . . . . . . . . . .
47
4.2
Protocol Operation . . . . . . . . . . . . . .
48
.
.
4.1
Read Requests
. . . . . . . . . . . .
50
4.2.2
Write Requests
. . . . . . . . . . . .
52
4.2.3
Evictions and Invalidations . . . . . .
52
.
.
.
4.2.1
Predicting Remote-+Private Transitions
. .
53
4.4
Limited Locality Classifier . . . . . . . . . .
55
4.5
Selection of PCT . . . . . . . . . . . . . . .
57
4.6
Overheads of the Locality-Based Protocol . .
57
.
.
.
.
4.3
Storage
. . . . . . . . . . . . . . . .
57
4.6.2
Cache & Directory Accesses . . . . .
58
4.6.3
Network Traffic . . . . . . . . . . . .
59
4.6.4
Transitioning between Private/ Remote
.
.
.
4.6.1
60
.
Modes
Simpler One-Way Transition Protocol . . . .
60
4.8
Synergy with ACKwise . . . . . . . . . . . .
60
4.9
Potential Advantages of Locality-Aware Cache Coherence . . . . . . .
61
. . . . . . . . . . .
. . . . . . .
61
4.10.1 Evaluation Metrics . . . . . . . . . .
. . . . . . .
62
4.11 R esults . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
63
.
4.10 Evaluation Methodology
.
.
.
4.7
.
4
3.5.1
.
3.6
.
3.2
4.11.1 Energy and Completion Time Trends
8
63
. . . . . . . . . . . . . . . . . . . . .
68
4.11.3 Tuning Remote Access Thresholds . . . . . . . . . . . . . . . .
68
4.11.4 Limited Locality Tracking . . . . . . . . . . . . . . . . . . . .
69
4.11.5 Simpler One-Way Transition Protocol . . . . . . . . . . . . . .
71
4.11.6 Synergy with ACKwise . . . . . . . . . . . . . . . . . . . . . .
72
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
4.11.2 Static Selection of PCT
4.12 Sum m ary
5
75
Timestamp-based Memory Ordering Validation
5.1
Principal Contributions . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5.2
Background: Out-of-Order Processors . . . . . . . . . . . . . . . . . .
78
5.2.1
Load-Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.2.2
Lifetime of Load/Store Operations
. . . . . . . . . . . . . . .
80
5.2.3
Precise State
. . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Background: Memory Models Specification . . . . . . . . . . . . . . .
81
5.3.1
Operational Specification . . . . . . . . . . . . . . . . . . . . .
82
5.3.2
Axiomatic Specification
. . . . . . . . . . . . . . . . . . . . .
83
Timestamp-based Consistency Validation . . . . . . . . . . . . . . . .
84
5.4.1
Simple Implementation of TSO Ordering . . . . . . . . . . . .
84
.5.4.2
Basic Timestamp Algorithm . . . . . . . . . . . . . . . . . . .
85
5.4.3
Finite History Queues
. . . . . . . . . . . . . . . . . . . . . .
92
5.4.4
In-Flight Transaction Timestamps . . . . . . . . . . . . . . . ..
95
5.4.5
Mixing Remote Accesses and Private Caching
. . . . . . . . .
97
5.4.6
Parallel Remote Stores . . . . . . . . . . . . . . . . . . . . . .
99
5.4.7
O verheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101
5.4.8
Forward Progress & Starvation Freedom Guarantees . . . . . .
103
D iscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
103
5.5.1
Other Memory Models . . . . . . . . . . . . . . . . . . . . . .
103
5.5.2
Multiple Clock Domains . . . . . . . . . . . . . . . . . . . . .
104
Parallelizing Non-Conflicting Accesses . . . . . . . . . . . . . . . . . .
105
Classification . . . . . . . . . . . . . . . . . . . . . . . . . . .
105
5.3
5.4
5.5
5.6
5.6.1
9
5.7
5.8
5.9
6
5.6.2
TLB Miss Handler
. . . . . . . . . . . . . . . . . . . . . . . .
106
5.6.3
Page Protection Fault Handler . . . . . . . . . . . . . . . . . .
107
5.6.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
107
5.6.5
Combinining with Timestamp-based Speculation . . . . . . . .
107
. . . . . . . . . . . . . . . . . . . . . . . . .
108
. . . . . . . . . . . . . . . . . . . . . . .
108
. . . . . . . . . . . . . . . . . . . . . . . . . .
109
R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
109
5.8.1
Comparison of Schemes . . . . . . . . . . . . . . . . . . . . . .
109
5.8.2
Sensitivity to PCT . . . . . . . . . . . . . . . . . . . . . . . .
114
5.8.3
Sensitivity to History Retention Period (HRP) . . . . . . . . .
116
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
117
Evaluation Methodology
5.7.1
Performance Models
5.7.2
Energy Models
Summary
119
Locality-aware LLC Replication Scheme
6.1
6.2
6.3
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
120
6.1.1
Cache Line Reuse . . . . . . . . . . . . . . . . . . . . . . . . .
120
6.1.2
Cluster-level Replication . . . . . . . . . . . . . . . . . . . . .
121
6.1.3
Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . .
124
Locality-Aware LLC Data Replication . . . . . . . . . . . . . . . . . .
124
6.2.1
Protocol Operation . . . . . . . . . . . . . . . . . . . . . . . .
125
6.2.2
Limited Locality Classifier Optimization
. . . . . . . . . . . .
130
6.2.3
Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
132
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
133
. . . . . . . . . . . . . . . . . . . .
133
6.3.1
Replica Creation Strategy
6.3.2
Coherence Complexity
. . . . . . . . . . . . . . . . . . . . . .
133
6.3.3
Classifier Organization
. . . . . . . . . . . . . . . . . . . . . .
134
6.4
Cluster-Level Replication . . . . . . . . . . . . . . . . . . . . . . . . .
135
6.5
Evaluation Methodology
. . . . . . . . . . . . . . . . . . . . . . . . .
137
6.5.1
Baseline LLC Management Schemes . . . . . . . . . . . . . . .
137
6.5.2
Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . .
138
10
6.7
6.6.1
Comparison of Replication Schemes
. . . . . . . . . . 139
6.6.2
LLC Replacement Policy . . . . . .
. . . . . . . . . . 144
6.6.3
Limited Locality Classifier . . . . .
. . . . . . . . . . 145
6.6.4
Cluster Size Sensitivity Analysis . .
. . . . . . . . . . 146
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . 147
.
.
.
.
.
. . . . . . . . . . 139
Summary
149
Locality-Aware Cache Hierarchy Replication
Motivation . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 149
7.2
Implementation . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 151
7.2.1
Microarchitecture Modifications . .
. . . . . . . . . . . . . . 151
7.2.2
Protocol Operation . . . . . . . . .
. . . . . . . . . . . . . .
7.2.3
Optimizations . . . . . . . . . . . .
. . . . . . . . . . . . . . 156
7.2.4
Overheads . . . . . . . . . . . . . .
. . . . . . . . . . . . . . 156
. . . . . . . . . .
. . . . . . . . . . . . . . 157
.
.
.
.
.
.
7.1
152
Evaluation Methodology
7.4
Results . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
158
. .
. . . . . . . . . . . . . .
161
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
163
.
PCT and RT Threshold Sweep
Summary
.
7.4.1
7.5
.
7.3
.
7
Results . . . . . . . . . . . . . . . . . . . .
.
6.6
165
8 Related Work
Data Replication
. . . . . . . . . . . . . . . . . . . .
. . . . . . . .
165
8.2
Coherence Directory Organization . . . . . . . . . . .
. . . . . . . .
169
8.3
Selective Caching / Dead-Block Eviction . . . . . . .
. . . . . . . .
171
8.4
Remote Access
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
173
8.5
Data Placement and Migration
. . . . . . . . . . . .
. . . . . . . .
173
8.6
Cache Replacement Policy . . . . . . . . . . . . . . .
. . . . . . . .
175
8.7
Cache Partitioning / Cooperative Cache Management
. . . . . . . .
176
8.8
Memory Consistency Models . . . . . . . . . . . . . .
. . . . . . . .
178
8.9
On-Chip Network and DRAM Performance . . . . . .
. . . . . . . .
178
.
.
.
.
.
.
.
.
8.1
11
9
181
Conclusion
9.1
Thesis Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . .
182
9.2
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
183
9.2.1
Hybrid Software/Hardware Techniques . . . . . . . . . . . . .
184
9.2.2
Classifier Compression . . . . . . . . . . . . . . . . . . . . . .
184
9.2.3
Optimized Variant of Non-Conflicting Scheme
. . . . . . . . .
184
12
List of Figures
2-1
Architecture of the baseline system.
Each core consists of a com-
pute pipeline, private Li instruction and data caches, a physically
distributed shared L2 cache with integrated directory and a network
rou ter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3-1
Structure of an ACKwise, coherence directory entry . . . . . . . . . .
37
3-2
Broadcast routing on a mesh network from core B. X-Y dimensionorder routing is followed. . . . . . . . . . . . . . . . . . . . . . . . . .
3-3
Illustration of deadlock when using broadcasts on a 1-dimensional mesh
network with X-Y routing. . . . . . . . . . . . . . . . . . . . . . . . .
3-4
39
40
Completion time of the ACKwisek protocol when k is varied as 1, 2, 4,
8, 16 & 256. Results are normalized to the full-map protocol (k = 256). 42
Energy of the ACKwisek protocol when k is varied as 1, 2, 4, 8, 16
&
3-5
256. Results are normalized to the full-map protocol (k = 256).
3-6
. .
43
Completion Time of the ACKwisek protocol when run on a mesh network without broadcast support. NB signifies the absence of broadcast
support. k is varied as 1, 2, 4, 8 & 16. Results are normalized to the
performance of the ACKwise 4 protocol on a mesh with broadcast support. 44
Completion time of the DirkNB protocol when k is varied as 2, 4, 8
&
3-7
16. Results are normalized to the ACKwise 4 protocol. . . . . . . . . .
45
. . . . . . . . . . . . . . . . . . . . . . .
47
. . . . . . . . . . . . . . . . . . . . . . . . .
48
4-1
Invalidations vs Utilization.
4-2
Evictions vs Utilization.
13
4-3
(j), (
and
@ are
mockup requests showing the two modes of accessing on-
chip caches using our locality-aware protocol.
Since the black data block
has high locality with respect to the core at (),
the directory at the home-
node hands out a private copy of the cache line.
On the other hand, the
low-locality red data block is always cached in a single location at its homenode, and all requests (g,
accesses.
4-4
@)
are serviced using roundtrip remote-word
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Each cache line is initialized to Private with respect to all sharers. Based
on the utilization counters that. are updated on each memory access to this
cache line and the parameter PCT, the sharers are transitioned between
Private and Remote modes. Here utilization = (private + remote) utilization. 50
4-5
Each LI cache tag is extended to include additional bits for tracking (a)
private utilization, and (b) last-access time of the
4-6
cache line.
. . . . . . .
50
A CKwise, - Complete classifier directory entry. The directory entry contains
the state, tag, ACKwisep pointers as well as (a) mode (P/R.), (b) remote
utilization counters and (c) last-access timestamps for tracking the locality
of all the cores in the system .
4-7
. . . . . . . . . . . . . . . . . . . . . . .
51
The limited locality classifier extends the directory entry with mode, utilization, and RAT-level bits for a limited number of cores. A majority vote of
the modes of tracked cores is used to classify new cores as private or remote
sharers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-8
Variation of Energy with PCT. Results are normalized to a PCT of 1.
Note that Average and not Geometric-Mean is plotted here.
4-9
55
. . . . .
64
Variation of Completion Time with PCT. Results are normalized to a
PCT of 1. Note that Average and not Geometric-Mean is plotted here.
64
4-10 Li Data Cache Miss Rate and Miss Type Breakdown vs PCT. Note
that in this graph, the miss rate increases from left to right as well as
from top to bottom .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
14
65
4-11 Variation of Geometric-Means of Completion Time and Energy with
Private Caching Threshold (PCT). Results are normalized to a PCT
ofI. .........
68
....................................
4-12 Remote Access Threshold sensitivity study for n.RA Teeis (L) and RA Tma,
68
(T ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-13 Variation of Completion Time and Energy with the number of hardware
locality counters (k) in the Limitedk classifier. Limited 64 is identical to the
Complete classifier. Benchmarks for which results are not shown are identical
to WATER-SP, i.e., the Completion Time and Enerqy stay constant as k
varies.
70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-14 Cache miss rate breakdown variation with the number of hardware locality
counters (k) in the Limitedk classifier. Limited 64 is identical to the Complete
classifier.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4-15 Ratio of Completion Time and Energy of Adaptl -ay over Adapt2 _way
71
4-16 Synergy between the locality-aware coherence protocol and ACKwise.
Variation of average and maximum sharer count during invalidations
as a function of PCT . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
5-1
Operational Specification of the TSO Memory Consistency Model. . .
82
5-2
Microarchitecture of a multicore tile. The orange colored modules are
added to support the proposed modifications.
. . . . . . . . . . . . .
85
5-3
Structure of a load queue entry.
. . . . . . . . . . . . . . . . . . . . .
87
5-4
Structure of a store queue entry. . . . . . . . . . . . . . . . . . . . . .
87
5-5
Structure of a page table entry.
. . . . . . . . . . . . . . . . . . . . .
106
5-6
Completion Time breakdown for the schemes evaluated. Results are
normalized to that of R-NUCA. Note that Aerage and not GeometricM ean is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-7
110
Energy breakdown for the schemes evaluated. Results are normalized
to that of R-NUCA. Note that Average and not Geometric-Mean is
plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111
15
5-8
Completion Time and Energy consumption as PCT varies from 1 to 16.
Results are normalized to a PCT of 1 (i.e., Reactive-NUCA protocol).
5-9
Completion Time sensitivity to History Retention Period (HRP) as
HRP varies from 64 to 4096. . . . . . . . . . . . . . . . . . . . . . . .
6-1
115
116
Distribution of instructions, private data, shared read-only data, and shared
read-write data accesses to the LLC as a function of run-length. The classification is done at the cache line granularity. . . . . . . . . . . . . . . . .
6-2
120
Distribution of the accesses to the shared L2 cache as a function of the
sharing degree. The accesses are broken down into (1) Instruction, (2) Shared
Read-Only (RO) Data, (3) Shared Read-Write (RW) Data Read and (4)
Shared Read-Write (RW) Data Write.
6-3
. . . . . . . . . . . . . . . . . . .
122
Distribution of the accesses to the shared L2 cache as a function of the
sharing degree. The accesses are broken down into (1) Instruction, (2) Shared
Read-Only (RO) Data, (3) Shared Read-Write (RW) Data Read and (4)
Shared Read-1Write (RWV) Data Write.
6-4
-
. . . . . . . . . . . . . . . . . . .
123
are mockup requests showing the locality-aware LLC replication
protocol. The black data block has high reuse and a local LLC replica is
allowed that services requests from
) and g. The low-reuse red data block
is not allowed to be replicated at the LLC, and the request from
@
that
misses in the LI, must access the LLC slice at its home core. The home core
for each data block can also service local private cache misses (e.g., T). . .
6-5
125
Each directory entry is extended with replication mode bits to classify the
usefulness of LLC replication. Each cache line is initialized to non-replica
mode with respect to all cores. Based on the reuse counters (at the home as
well as the replica location) and the parameter RT, the cores are transitioned
between replica and non-replica modes. Here XReuse is (Replica + Home)
Reuse on an invalidation and Replica Reuse on an eviction.
16
. . . . . . . .
126
6-6
ACKwiseP-Complete locality classifier LLC tag entry. It contains the tag,
LRU bits and directory entry. The directory entry contains the state, ACKwisep
pointers, a Replica reuse counter as well as Replication mode bits and Home
reuse counters for every core in the system.
6-7
. . . . . . . . . . . . . . . .
127
ACKwise -Limitedk locality classifier LLC tag entry. It contains the tag,
LRU bits and directory entry. The directory entry contains the state, ACKwisep
pointers, a Replica reuse counter as well as the Limitedk classifier.
The
Limitedk classifier contains a Replication mode bit and Home reuse counter
for a limited number of cores. A majority vote of the modes of tracked cores
is used to classify new cores as replicas or non-replicas.
6-8
. . . . . . . . . .
130
Energy breakdown for the LLC replication schemes evaluated. Results are
normalized to that of S-NUCA. Note that Average and not Geometric-Mean
is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6-9
140
Completion Time breakdown for the LLC replication schemes evaluated.
Results are normalized to that of S-NUCA. Note that Average and not
Geometric-Mean is plotted here.
. . . . . . . . . . . . . . . . . . . . . .
141
6-10 Li Cache Miss Type breakdown for the LLC replication schemes evaluated.
142
6-11 Energy and Completion Time for the Limitedk classifier as a function of
number of tracked sharers (k). The results are normalized to that of the
Complete (= Limited 64 ) classifier.
. . . . . . . . . . . . . . . . . . . . .
145
6-12 Energy and Completion Time at cluster sizes of 1, 4. 16 and 64 with the
locality-aware data replication protocol. A cluster size of 64 is the same as
R-NUCA except that it does not even replicate instructions. . . . . . . . .
17
147
7-1
The red block prefers to be in the I" mode, being replicated only at the
L1-I/L1-D cache. Cores N and P access data directly at the Li cache. The
&
blue block prefers to be in the 2 nd mode, being replicated at both the LI
L2 caches. Cores ( and P access data directly at the L. cache. The violet
block prefers to be in the 3,d mode, and is accessed remotely at the L2 home
location without being replicated in either the Li or the L2 cache. Cores
P and
@ access
data using remote-word requests at the L2 home location.
And finally, the green block prefers the 41h mode, being replicated at the L2
cache and accessed using word accesses at the L2 replica location. Cores
and U access data, using remote-word requests.
7-2
Q
. . . . . . . . . . . . . .
150
Modifications to the L2 cache line tag. Each cache line is augmented with
a Private Reuse counter that tracks the number of times a cache line has
been accessed at the L2 replica location. In addition, each cache line tag
has classifiers for deciding whether or not to replicate lines in the Li cache
& L2 cache. Both the Li & L2 classifiers contain the Mode, Home Reuse,
and RAT-Level fields that serve to remember the past locality information
for each cache line. The above information is only maintained for a limited
number of cores, k, and the mode of untracked cores is obtained by a majority
vote.
7-3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
152
Completion Time breakdown for the schemes evaluated. Results are
normalized to that of R-NUCA. Note that Average and not GeometricM ean is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . .
7-4
159
Energy breakdown for the schemes evaluated. Results are normalized
to that of R-NUCA. Note that Average and not Geometric-Mean is
plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7-5
Variation of Completion Time as a function of PCT& RT. The GeometricMean of the completion time obtained from all benchmarks is plotted.
7-6
160
162
Variation of Energy as a function of PCT & RT. The Geometric-Mean
of the energy obtained from all benchmarks is plotted.
18
. . . . . . . .
162
List of Tables
2.1
Architectural parameters used for evaluation . . . . . . . . . . . . . .
33
2.2
Projected Transistor Parameters for 11 nm Tri-Gate . . . . . . . . . .
34
2.3
Problem sizes for our parallel benchmarks. . . . . . . . . . . . . . . .
35
4.1
Cache sizes per core.
. . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.2
Storage required for the caches and coherence protocol per core.
. . .
58
4.3
Locality-aware cache coherence protocol parameters . . . . . . . . . .
61
5.1
Timestamp-based speculation violation detection & locality-aware cache
private cache replication scheme parameters. . . . . . . . . . . . . . .
108
. . . . . . . . . . .
137
6.1
Locality-aware LLC (7L2) replication parameters
7.1
Locality-aware protocol & timestamp-based speculation violation detection param eters. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
157
20
Chapter 1
Introduction
Increasing the number of cores has replaced clock frequency scaling as the method to
improve performance in state-of-the-art multicore processors. These multiple cores
can either be used in parallel by multiple applications or by multiple threads of
the same application to complete work faster. Maintaining good multicore scalability
while ensuring good single-core performance is of the utmost importance in continuing
to improve performance and energy efficiency.
1.1
Performance and Energy Efficiency
In the era of multicores, programmers need to invest more effort in designing software capable of exploiting multicore parallelism. To reduce memory access latency
and power consumption, a programmer can manually orchestrate communication and
computation or adopt the familiar programming paradigm of shared memory. But
will current shared memory architectures scale to many cores? This thesis addresses
the question of how to enable low-latency, low-energy memory access while retaining
the convenience of shared memory.
Current semiconductor trends project the advent of single-chip multicores dealing
with data at unprecedented scale and complexity. Memory scalability is critically constrained by off-chip bandwidth and on-chip latency and energy consumption [171. For
multicores dealing with massive data, memory access latency and energy consumption
21
are now first-order design constraints.
1.1.1
Cache Management
A large, monolithic physically shared on-chip cache does not scale beyond a small
number of cores, and the only practical option is to physically distribute memory in
pieces so that every core is near some portion of the cache [131. In theory this provides
a large amount of aggregate cache capacity and fast private memory for each core.
Unfortunately, it is difficult to manage distributed private caches effectively as they
require architectural support for cache coherence and consistency.
Popular directory-based protocols enable fast local caching to exploit data locality, but scale poorly with increasing core counts [6, 66]. Many recent proposals
(e.g., Tagless Directory [96], SPACE [99], SPATL [100], SCD [801, In-Network Cache
Coherence [29]) have addressed directory scalability in single-chip multicores using
complex sharer compression techniques or on-chip network capabilities.
Caches in state-of-the-art multicore processors are typically organized into multiple levels to take advantage of the multiple working sets in an application. The lower
level(s) of the cache hierarchy (i.e., closer to the compute pipeline) have traditionally
been private to each core while the the highest level a.k.a. last-level cache (LLC) has
been diverse across production processors [13, 381. Different LLC organizations offer
trade-offs between on-chip data locality and off-chip miss rate.
While private LLC organizations (e.g., AMD Opteron [261, Knights Landing [4])
have low hit latencies, their off-chip miss rates are high in applications that exhibit
high degrees of data sharing (due to cache line replication). In addition, private LLC
organizations are inefficient when running multiple applications that have uneven
distributions of working sets.
Shared LLC organizations (e.g., [21), on the other hand, lead to non-uniform cache
access latencies (NUCA) [50, 421 that hurt on-chip locality. However, they improve
cache utilization and thereby, off-chip miss rates since cache lines are not replicated.
Exploiting spatio-temporal locality in such organizations is challenging since cache
latency is sensitive to data placement. To address this problem, coarse-grain data
22
placement, migration and restrictive replication schemes have been proposed (e.g.,
Victim Replication [97], Adaptive Selective Replication [10], Reactive-NUCA [39],
CMP-NuRAPID [24]). These proposals attempt to combine the good characteristics
of private and shared LLC organizations.
All previous proposals assume private low-level caches and use snoopy/directorybased cache coherence to keep the private caches coherent. A request for data allocates
and replicates a data block in the private cache hierarchy even if the data has no
spatial or temporal locality. This leads to cache pollution since such low locality data
can displace more frequently used data. Since on-chip wires are not scaling at the
same rate as transistors [16], unnecessary data movement and replication not only
impacts latency, but also consumes extra network and cache power [661.
In addition, previous proposals for hybrid last-level cache (LLC) management
either do not perform fine-grained adaptation of their policies to dynamic program
behavior or replicate cache lines based on static policies without paying attention to
their locality. In addition, some of them significantly complicate coherence (e.g.,
124])
and do not scale to large core counts.
1.1.2
Single-core Performance
In addition to maintaining multicore scalability, good single-core performance must
be ensured to optimally utilize the functional units on each core and thereby, extract the best performance.
Since the working sets of application rarely fit within
the Li cache, ensuring good single-core performance requires the exploitation of the
memory level parallelism (MLP) in an application.
MLP can be exploited using
out-of-order (000) processors through dynamically scheduling independent memory operations. With in-order processors, prefetching (both software/hardware) and
static compiler scheduling techniques can be used to exploit MLP. However, it is worth
noting that current industry trends are moving towards out-of-order processors in em-
bedded (Atom [72], ARM [36]), server (Xeon [381, SPARC [11) and high-performance
computing processors (Knights Landing [41).
23
1.2
Programmability and Memory Models
In addition to maintaining good performance and energy efficiency, future multicore
processors have to maintain ease of programming. The programming complexity is
significantly affected by the memory consistency model of the processor. The memory
model dictates the order in which the memory operations of one thread appear to
another. The strongest memory model is the Sequential Consistency (SC) [591 model.
SC mandates that the global memory order is an interleaving of the memory accesses
of each thread with each thread's memory accesses appearing in program order in
this global order. SC is the most intuitive model to the software developer and is the
easiest to program and debug with.
Production processors do not implement SC due to its negative performance impact. SPARC RMO [1], ARM [36] and IBM Power [65] processors implement relaxed
(/weaker) memory models that allow reordering of load & store instructions with explicit fences for ordering when needed. These processors can better exploit memory
level parallelism (MLP), but require careful programmer-directed insertion of memory fences to do so. Intel x86 [83j and SPARC [1] processors implement Total Store
Order (TSO), which attempts to strike a balance between programmability and performance. The TSO model only relaxes the Store-+Load ordering of SC, and improves
performance by enabling loads (that are crucial to performance) to bypass stores in
the write buffer.
Any new optimizations implemented for improving energy and performance must
be compatible with the memory consistency model of the processor and all existing
optimizations. Otherwise, the existing code can no longer be supported and programmers should undergo constant re-training to stay updated with the latest changes to
the memory model. An alternate solution is to off-load the responsibility of adapting
to memory model changes to the compiler while providing a constant model to the
programmer. However, this requires automated insertion of fences which is still an
active area of research [7].
24
1.3
Thesis Contributions
This thesis makes the following five principal contributions that holistically address
performance, energy efficiency and programmability.
1. Proposes a scalable limited directory-based coherence protocol, ACKwise [56,
571 that reduces the directory storage needed to track the sharers of a data
block.
2. Proposes a Locality-aware Private Cache Replication scheme [54] to better manage the private caches in multicore processors by intelligently controlling data
caching and replication.
3. Proposes a Timestamp-based Memory Ordering Validation technique that enables the preservation of familiar memory consistency models when the intelligent private cache replication scheme is applied to state-of-the-art production
processors.
4. Proposes a Locality-aware LLC Replication scheme [53] that better manages the
last-level shared cache (LLC) in multicore processors by balancing shared data
(and instruction) locality and off-chip miss rate through controlled replication.
5. Proposes a Locality-Aware Adaptive Cache Hierarchy Management Scheme that
seamlessly combines all the above schemes to provide optimal data locality and
miss rates at all levels of the cache hierarchy.
These five contributions are briefly summarized below.
1.3.1
ACKwise Directory Coherence Protocol
This thesis proposes the ACKwise limited-directory based coherence protocol. ACKwise, by using a limited set of hardware pointers to track the sharers of a cache line,
incurs much less area and energy overhead than the conventional directory-based protocol. If the number of sharers exceeds the number of hardware pointers, ACKwise
25
does not track the identities of the sharers anymore. Instead, it tracks the number
of sharers. On an exclusive request, an invalidation request is broadcast to all the
cores. However, acknowledgements need to be sent by only the actual sharers since
the number of sharers is tracked. The invalidation broadcast is handled efficiently
by making simple changes to an electrical mesh network. The ACKwise protocol is
advantageous because it achieves the performance of the full-map directory protocol
while reducing the area and energy consumption dramatically.
1.3.2
Locality-aware Private Cache Replication [541
This thesis proposes a scalable, efficient protocol that better manages private caches
by enabling seamless adaptation between private and logically shared caching at the
fine granularity of cache lines. When a core makes a memory request that misses the
private caches, the protocol either brings the entire cache line using a directory-based
coherence protocol, or just accesses the requested word at the shared cache location
using a roundtrip message over the network. The second style of data management
is called "remote access". The decision is made based on the spatio-temporallocality
of a particular data block. The locality of each cache line is profiled at runtime by
measuring its reuse, i.e., the number of times the cache line is accessed before being
removed from its private cache location. Only data blocks with high spatio-temporal
locality (i.e., high reuse) are allowed to be privately cached.
A low-overhead, highly accurate hardware predictor that tracks the locality of
cache lines is proposed. The predictor takes advantage of the correlation exhibited
between the reuse of multiple cores for a cache line. Only the locality information
for a few cores per cache line is maintained and the locality information for others is
predicted by taking a majority vote. This locality tracking mechanism is decoupled
from the sharer tracking structures. Hence, this protocol can work with ACKwise
or any other directory coherence protocol that enables scalable tracking of sharers.
However, it is worth noting that the locality-aware protocol makes ACKwise even
more efficient since it reduces the number of privately-cached copies of a data block.
Locality-aware Private Cache Replication is advantageous because it:
26
1. Better exploits on-chip private cache capacity by intelligently controlling data
caching and replication.
2. Lowers memory access latency by trading off unnecessary cache evictions or
expensive invalidations with much cheaper word accesses.
3. Lowers energy consumption by better utilizing on-chip network and cache resources.
1.3.3
Timestamp-based Memory Ordering Validation
This thesis explores how to extend the above locality-aware private cache replication
scheme for processors with out-of-order memory (i.e., supporting multiple outstanding memory transactions using non-blocking caches) that employ popular memory
models. The remote access required by the locality-aware coherence protocol is incompatible with the following two optimizations implemented in such processors to
exploit memory level parallelism (MLP). (1) Speculative out-of-order execution is
used to improve load performance, enabling loads to be issued and completed before
previous load/fence operations. Memory consistency violations are detected when invalidations, updates or evictions are made to addresses in the load buffer. The pipeline
state is rolled back if this situation arises. (2) Exclusive store prefetch
[33] requests
are used to improve store performance. These prefetch requests fetch the cache line
into the L1-D cache and can be executed out-of-order and in parallel. However, the
store requests must be issued and completed following the ordering constraints specified by the memory model. But, most store requests hit in the L1-D cache (due to
the earlier prefetch request) and hence can be completed quickly.
Remote access is incompatible due to the following two reasons. (1) A remote load
does not create private cache copies of cache lines and hence, invalidation/ update
requests cannot be used to detect memory consistency violations for speculatively
executed operations. (2) A remote store also never creates private cache copies and
hence, exclusive store prefetch requests cannot be employed to improve performance.
In this thesis, a novel technique is proposed that uses timestamps to detect memory
27
consistency violations when speculatively executing loads under the locality-aware
protocol. Each load and store operation is assigned an associated timestamp and a
simple arithmetic check is done at commit time to ensure that memory consistency has
not been violated. This technique does not rely on invalidation/update requests and
hence is applicable to remote accesses. The timestamp mechanism is efficient due to
the observation that consistency violations occur due to conflicting accesses that have
temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps
to be stored only for a small time window.
This technique works completely in
hardware and requires only 2.5 KB of storage per core.
This scheme guarantees
forward progress and is starvation-free. An implementation of the technique for the
Total Store Order (TSO) model is provided and adaptations to implement it on other
memory models are discussed.
1.3.4
Locality-aware LLC Replication [531
This thesis proposes a data replication mechanism for the LLC that retains the onchip cache utilization of the shared LLC while intelligently replicating cache lines close
to the requesting cores so as to maximize data locality. The proposed scheme only
replicates those cache lines that demonstrate reuse at the LLC while bypassing the
replication overheads for the other cache lines. To achieve this goal, a low-overhead yet
highly accurate in-hardware locality classifier (14.5 KB storage per core) is proposed
that operates at the cache line granularity and only allows the replication of cache
lines with high spatio-temporal locality (similar to the earlier private cache replication
technique). This classifier captures the LLC pressure and adapts its replication decision accordingly. This data replication mechanism does not involve "remote access"
and hence, does not require special support for adhering to a memory consistency
model.
Locality-aware LLC Replication is advantageous because it:
1. Lowers memory access latency and energy by selectively replicating cache lines
that show high reuse in the LLC slice of the requesting core.
28
2. Better exploits the LLC by balancing the off-chip miss rate and on-chip locality
using a classifier that adapts to the run-time reuse at the granularity of cache
lines.
3. Allows coherence complexity almost identical to that of a traditional nonhierarchical (flat) coherence protocol since replicas are only allowed to be placed
at the LLC slice of the requesting core. The additional coherence complexity
only arises within a core when the LLC slice is searched on a private cache miss,
or when a cache line in the core's local cache hierarchy is evicted/ invalidated.
1.3.5
Locality-aware Cache Hierarchy Replication
This thesis combines the private cache replication and LLC replication schemes discussed in Chapters 4 & 6 with the timestamp-based memory ordering validation technique presented in Chapter
5 into a combined cache hierarchy replication scheme. The
design of this scheme is motivated by the experimental observation that both localityaware private cache & LLC replication enable varying performance improvements for
benchmarks. Certain benchmarks exhibit improvement only with intelligent private
cache replication, certain others exhibit improvement only with locality-aware LLC
replication, while certain benchmarks exhibit improvement with both locality-aware
private cache & LLC replication. This necessitates the design of a combined replication scheme that exploits the benefits of both the schemes introduced earlier.
1.3.6
Results Summary
On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% while incurring a storage overhead of 30.5 KB per core (i.e., 10% the aggregate cache capacity
of each core). Locality-aware Private Cache Replication alone improves completion
time by 13% and energy by 15% while incurring a storage overhead of 21.5 KB per
core. Locality-aware LLC Replication improves completion time by 10% and energy
by 15% while incurring a storage overhead of 14.5 KB per core. Note that in the
29
evaluated system, the Li cache is the only private cache level and the L2 cache is the
last-level cache (LLC).
1.4
Organization of Thesis
The rest of this thesis is organized as follows.
Chapter 2 describes the evaluation
methodology, including the baseline system. Chapter 3 describes the ACKwise directory coherence protocol. Chapter 4 describes the locality-aware private cache replication scheme. Chapter 5 describes the memory consistency implications of the localityaware scheme. It proposes a novel timestamp-based technique to enable the implementation of the locality-aware scheme in state-of-the-art processors while preserving
their memory consistency model.
Chapter 6 describes the locality-aware adaptive
&
LLC replication scheme. Chapter 7 combines the schemes developed in Chapters 4
6 along with the mechanism to detect memory consistency violations in Chapter 5 to
propose a locality-aware cache hierarchy management scheme. Chapter 8 describes
the related work in detail and compares and contrasts that against the work described
in this thesis. Finally, Chapter 9 concludes the thesis.
30
Chapter 2
Evaluation Methodology
Baseline Architecture
core
/
2.1
Compute Pipeline
Li I-Cache
Li D-Cache
L2 Shared Cache
Directory
Router
Figure 2-1: Architecture of the baseline system. Each core consists of a compute
pipeline, private Li instruction and data caches, a physically distributed shared L2
cache with integrated directory and a network router.
The baseline system is a tiled multicore with an electrical 2-D mesh interconnection network as shown in Figure 2-1. Each core consists of a compute pipeline, private
Li instruction and data caches, a physically distributed shared L2 cache with integrated directory, and a network router. The coherence directory is integrated with
the L2 slices by extending the L2 tag arrays (in-cache directory organization [18, 13])
and tracks the sharing status of the cache lines in the per-core private Li caches. The
31
private caches are kept coherent using a full-map directory-based coherence protocol.
Some cores along with the periphery of the electrical mesh have a connection to a
memory controller as well. The mesh network uses dimension-order X-Y routing and
wormhole flow control.
The shared L2 cache is managed using the data placement, replication and migrations mechanisms of Reactive-NUCA [39] as follows. Private data is placed at the
L2 slice of the requesting core, shared data is interleaved at the OS page granularity
across all L2 slices, and instructions are replicated at a single L2 slice for every cluster of 4 cores using a rotational interleaving mechanism. Classification of data into
private and shared is done at the OS page granularity by augmenting the existing
page table and TLB (Translation Lookaside Buffer) mechanisms.
We evaluate both 64-core and 256-core multicore processors. The important architectural parameters used for evaluation are shown in Table 2.1. Both in-order and
out-of-order cores are evaluated (the parameters for each are shown in Table 2.1).
2.2
Performance Models
All experiments are performed using the core, cache hierarchy, coherence protocol,
memory system and on-chip interconnection network models implemented within the
Graphite [68] multicore simulator. The Graphite simulator requires the memory system (including the cache hierarchy) to be functionally correct to complete simulation.
This is a good test that all our cache coherence protocols are working correctly given
that we have run 27 benchmarks to completion. The Nehalem decoder for the outof-order core is borrowed from the ZSim simulator [811.
The electrical mesh interconnection network uses X-Y routing.
network-on-chip routers are pipelined
Since modern
127], and 2- or even 1-cycle per hop router
latencies [52] have been demonstrated, we model a 2-cycle per hop delay; we also
account for the appropriate pipeline latencies associated with loading and unloading
a packet onto the network.
In addition to the fixed per-hop latency, network link
contention delays are also modeled.
32
Architectural Parameter
Value
Number of Cores
Clock Frequency
Processor Word Size
Physical Address Length
64 & 256
1 GHz
64 bits
48 bits
Issue Width
In-Order Core
1
Out-of-Order Core
Issue Width
1
Reorder Buffer Size
168
Load Queue Size
64
Store Queue Size
48
Memory Subsystem
Li-I Cache per core
16 KB, 4-way Assoc., 1 cycle
L1-D Cache per core
32 KB, 4-way Assoc., 1 cycle
L2 Cache (LLC) per core
256 KB, 8-way Assoc., 7 cycle
2 cycle tag, 4 cycle data
Inclusive, R-NUCA [39]
Cache Line Size
64 bytes
Invalidation-based MESI
Directory Protocol
Num. of Memory Controllers 8
DRAM Bandwidth
5 GBps per Controller
DRAM Latency
75 ns
Electrical 2-D Mesh with XY Routing
2 cycles (1-router, 1-link)
Hop Latency
64 bits
Flit Width
1 flit
Header
(Src, Dest, Addr, MsgType)
Word Length
1 flit (64 bits)
Cache Line Length
8 flits (512 bits)
Table 2.1: Architectural parameters used for evaluation
2.3
Energy Models
For energy evaluations of on-chip electrical network routers and links, we use the
DSENT
187] tool. Energy estimates for the Li-I, L1-D and L2 (with integrated
directory) caches as well as DRAM are obtained using McPAT [60].
When calculating the energy consumption of the L2 cache, we assume a word
addressable cache architecture. We model the dynamic energy consumption of both
the word access and the cache line access in the L2 cache using McPAT.
33
The energy evaluation is performed at the 11 nm technology node to account for
future scaling trends. We derive models for a trigate 11 nm electrical technology node
using the virtual-source transport models of [49] and the parasitic capacitance model
of [901.
These models are used to obtain electrical technology parameters (Table
2.2) used by both McPAT and DSENT. As clock frequencies are relatively slow, high
threshold (HVT) transistors are assumed for lower leakage.
Parameter
Value
Process Supply Voltage
Gate Length
Contacted Gate Pitch
Gate Cap
Drain Cap
0.6 V
14 nm
44 nm
(VDD)
2.420 fF/pm
Width
1.150 fF/pm
Width
Effective On Current / Width (N/P)
Off Current
/ Width
739/668 pA/pm
1 nA/pm
Table 2.2: Projected Transistor Parameters for 11 nm Tri-Gate
2.4
ToolFlow
The overall toolflow is as follows. Graphite runs a benchmark for the chosen cache configuration, and cache coherence protocol, producing event counters and performance
results. The specified cache and network configurations are also fed into McPAT and
DSENT to obtain dynamic per-event energies for each component. Event counters
and completion time output from Graphite are then combined with per-event energies
to obtain the overall dynamic energy usage of the benchmark.
2.5
Application Benchmarks
We simulate SPLASH-2 [93] benchmarks, PARSEC [14] benchmarks, Parallel-MIBench [431, a Travelling-Salesman-Problem
(DFs)
benchmark, a Matrix-Multiply
marks
(CONNECTED-COMPONENTS
&
(TSP)
(MATMUL)
benchmark, a Depth-First-Search
benchmark, and two graph bench-
COMMUNITY-DETECTION)
34
[31 using the Graphite
Problem Size
Application
SPLASH-2 [93]
4M integers, radix 1024
1024 x 1024 matrix, 16 x 16 blocks
2050 x 2050 ocean
1026 x 1026 ocean
tk29.0
64K particles
258 x 258 ocean
512 molecules
512 molecules
RADIX
LU-C, LU-NC
OCEAN-C
OCEAN-NC
CHOLESKY
BARNES
OCEAN
WATER-NSQUARED
WATER-SPATIAL
RAYTRACE
car
VOLREND
head
PARSEC [14]
BLACKSCHOLES
64K options
SWAPTIONS
64 swaptions, 20,000 sims.
1
STREAMCLUSTER
16384 points per block,
DEDUP
31 MB data
block
FERRET
256 queries, 34,973 images
BODYTRACK
4 frames, 4000 particles
FLUIDANIMATE
5
CANNEAL
200,000 elements
FACESIM
1 frame, 372,126 tetrahedrons
frames, 100,000 particles
Parallel MI Bench [43]
DIJKSTRA-SINGLE-SOURCE
Graph with 4096 nodes
DIJKSTRA-ALL-PAIRS
Graph with 512 nodes
PATRICIA
SUSAN
5000 IP address queries
PGM picture 2.8 MB
UHPC [3]
CONNECTED-COMPONENTS
Graph with 218 nodes
COMMUNITY-DETECTION
Graph with 216 nodes
Others
TSP
16 cities
MATRIX-MULTIPLY
1024 x 1024 matrix
DFS
Graph with 876800 nodes
Table 2.3: Problem sizes for our parallel benchmarks.
35
multicore simulator. The graph benchmarks model social networking based applications. The problem sizes for each application are shown in Table 2.3.
36
Chapter 3
ACKwise Directory Coherence
Protocol
This chapter presents A CKwise, a directory coherence protocol derived from an MSI
directory based protocol. Each directory entry in this protocol, as shown in Figure 3-1
is similar to one used in a limited directory organization [6] and contains the following
3 fields: (1) State: This field specifies the state of the cached block (one of the MSI
states); (2) Global(G): This field states whether the number of sharers for this data
block exceeds the capacity of the sharer list. If so, a broadcast is needed to invalidate
all the cached blocks corresponding to this address when a cache demands exclusive
ownership; (3) Sharers 1,_: This field represents the sharer list. It can track upto p
distinct core IDs.
G
Core IDTCrag D
Figure 3-1: Structure of an ACKwisep coherence directory entry
ACKwise operates similar to a full-map directory protocol when the number of
sharers is less than or equal to the number of hardware pointers (p) in the limited
directory. When the number of sharers exceeds the number of hardware pointers (p),
the Global(G) bit is set to true so that any number of sharers beyond this point can
37
be accommodated. Once the global (G) bit is set to true, the sharer list (Sharers,p)
just holds the total number of sharers of this data block.
3.1
Protocol Operation
When a request for a shared copy of a data block is issued, the directory controller
first checks the state of the data block in the directory cache.
(a) If the state is
Invalid(I), it forwards the request to the memory controller. The memory controller
fetches the data block from memory and sends it directly to the requester.
It also
sends an acknowledgement to the directory. The directory changes the state of the
data block to Shared(S). (b) If the state is Shared(S), the data is fetched from the
L2 cache and forwarded to the requester. (c) If the state is Modified(M), the data is
fetched from the Li cache of the owner and forwarded to the requester and the state
is set to Shared(S). In all the above cases, the directory controller also tries to add
the ID of the requester to the sharer list. This is straightforward if the global(G) bit
is clear and the sharer list has vacant spots. If global(G) bit is clear but the sharer list
is full, it sets the global(G) bit to true and stores the total number of sharers in the
sharer list. If the global(G) bit is already set to true, then it increments the number
of sharers by one.
When a request for an exclusive copy of a data block is issued, the directory
controller first checks the state of the data block in the directory cache. (a) If the
state is Invalid(), the sequence of actions followed is the same as that for the read
request except that the state of the data block in the directory is set to Modified(M)
instead of Shared(S). (b) If the state is Shared(S), then the actions performed by the
directory controller is dependent on the state of global (G) bit. If the global (G) bit
is clear, it unicasts invalidation messages to each core in the sharer list. Else, if the
global (G) bit is set, it broadcasts an invalidation message.
The sharers invalidate
their cache blocks and acknowledge the directory. The directory controller expects
as many acknowledgements as the number of sharers (encoded in the sharer list if
the global(G) bit is set and calculated directly if the global(G) bit is clear). After all
38
the acknowledgements are received, the directory controller sets the state of the data
block to Modified(M) and the global(G) bit to false. (c) If the state is Modified(M),
the directory flushes the owner and forwards the data to the requester. In all cases,
the requester is added to the sharer list as well.
3.2
Silent Evictions
Silent evictions of cache lines cannot be supported by this protocol since the exact
number of sharers has to be always maintained.
However, since evictions are not
on the critical path, they do not hurt performance other than by increasing network
contention delays negligibly. Since evictions do not contain data, the network energy
overhead is also very small. In a 1024-core processor, performance is only reduced by
1% and total network flits sent is only reduced by 3% if silent evictions are used.
Figure 3-2: Broadcast routing on a mesh network from core B. X-Y dimension-order
routing is followed.
3.3
Electrical Mesh Support for Broadcast
The mesh has to be augmented with native hardware support for broadcasts. The
broadcast is carried out by configuring each router to selectively replicate a broad-
39
Figure 3-3: Illustration of deadlock when using broadcasts on a 1-dimensional mesh
network with X-Y routing.
casted message on its output links. Dimension order X-Y routing is followed. Figure 3-2 illustrates how a broadcast is carried out starting from core B.
Even though unicasts with dimension-order X-Y routing do not cause network
deadlocks, broadcasts with the same routing scheme lead to deadlocks (assuming no
virtual channels).
This can be illustrated using the example in Figure 3-3.
There
are four broadcast packets and two routers. Packet-1 has three flits (1H, 1B, IT),
packet-2 has three flits (2H, 2B, 2T), packet-3 has two flits (3H, 3T) and packet-4 has
two flits (4H, 4T). Assume that the mesh is 1-dimensional for simplicity. Packet-i
wants to move to Router-B and is stuck behind Packet-4. Packet-4 wants to move
to Core-B (i.e., the core attached to Router-B) and is waiting for Packet-2 to finish
transmitting. Packet-2 wants to move to Router-A and is stuck behind Packet-3. And
Packet-3 wants to move to Core-A and is waiting for Packet-i to finish transmitting,
thereby completing the circular dependency.
Virtual channels can be used to avoid these deadlocks (e.g., Virtual circuit multicasting
147]). Without virtual channels, the broadcast packets have to be handled
using virtual cut-through flow control to ensure forward progress. With virtual cutthrough, Packet-i would completely reside in Router-A and Packet-2 in Router-B,
thereby removing the circular dependency.
Unicast packets can still use wormhole
flow control. Using two flow control techniques in the same network, one for broadcast and another for unicast, increases complexity.
40
Also, virtual cut-through flow
control places restrictions on the number of flits in each input port flit-buffer.
To
avoid such complications, the size of each broadcast packet is restricted to 1 flit. This
restriction is entirely reasonable since with a 64-bit flit, 48 bits can be allocated for
the physical address and 10 bits for the sender core ID (assuming a 1024-core processor). The remaining 6 bits suffices for storing the invalidation message type. This
restriction allows wormhole flow control to be used for all packets. The deadlock and
its avoidance mechanism have been simulated using a cycle-level version of Graphite
employing finite buffer models in the network and were found to operate as described
above.
3.4
Evaluation Methodology
We evaluate a 256-core shared memory multicore using out-of-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The mesh
network is equipped with broadcast support as described unless otherwise mentioned.
3.5
Results
In this section, the performance and energy efficiency of ACKwise is evaluated. First,
the sensitivity of ACKwise to the number of hardware pointers k is evaluated. Next,
its sensitivity to broadcast support is evaluated. And finally, ACKwisek is compared
to an alternate limited directory-based protocol DirkNB
3.5.1
[6].
Sensitivity to Number of Hardware Pointers (k)
Here, the sensitivity of the ACKwisek protocol to the number of hardware pointers
k, is evaluated.
The value k is varied from 1 to 256 in the following order: 1, 2,
4, 8, 16 & 256. The value 256 corresponds to the full-map directory organization,
which allocates a bit to track the sharing status of every core in the system. The
other values of k correspond to ACKwise with different numbers of hardware pointers.
Each hardware pointer is 8 bits long (since log2(256) = 8). So, with k = 1, 2, 4, 8,
41
16 & 256, the number of bits per cache line allocated to sharer tracking is 8, 16, 32,
64, 128 & 256 respectively.
" Instructions
" Branch Speculation
R L-1 Fetch Stalls
I Compute Stalls
M Idle
M Synchronization
U
Memory Stalls
1.2
,
II
0
'0
1 f
2s
44
4-0.2440 44 440
16 "026. R
2.
111111apl111111
In
44
1stsar nraize
L-I etc StRls: Stall~ timeB. u
1113fllma prtclik
oisrcinccems
dw
tinth followigds.
tie
isaried (aLU,,FP,8
5)
Thee copltio
timies rknd
tgris
Sallsn: Stall time due to
2.
Li-ych
3.
Comue: Sntalls iespetaliedt
7.dlest tial stppmr spnt
g protor whnt
as kaistvared
1. Insrunctios:eumberoofintrm uto
4.oMmoryetals:Stl time
l256)
e
Fiu3-4:Comletions tmeo th e t wii
16 ult6.ipltre. nrslizdt.fl-appoool(
Figurem3-4ypltas: thecltion
=
is-eppdictiobrnchisruts
itrugton che
waiting for
bmirses
adcndt
funhctdtioensal ni(ed.FU
uite aiton ladstoe ueef
scapacit lims, ncesdh
waitin fetork as thread to besand.eivldainb
d
cast requests created as a result of the limited directory size well. The number of
42
acknowledgement responses sent by ACKwisek on the other hand, is independent of
the value of k and is equal to that sent by the full-map directory (since ACKwisek
tracks the exact number of sharers). Overall, the ACKwise 4 protocol is better than
the other variants by ~ 0.5%.
1.2
'0 0.8
00.6
" Li-1 Cache
10 L1-D Cache
" Network Router
U
0 L2 Cache
Network Link
N Directory
N DRAM
II_
t
w0.2
0
44
BARNES
44440rx
CHOLESKY
4<44
LU-C
LU-NC
CANNEAL
;Z44
FLUIDANIM.
44Z4 4<44
OCEAN-C
44(40
41-t0 W44;Z40
RADIX
AVERAGE
Figure 3-5: Energy of the ACKwisek protocol when k is varied as 1, 2, 4, 8, 16 & 256.
Results are normalized to the full-map protocol (k =256).
Figure 3-5 plots the energy as a function of k. Initially, energy is found to reduce
as k is increased. This is due to the reduction in network traffic obtained when the
number of broadcasts is reduced. However, as k increases further, the directory size
increases to track more sharers. This increases the overall energy consumption. The
full-map protocol exhibits the largest directory size, and hence the highest directory
component of energy. Overall, the ACKwise 2 , ACKwise 4 and ACKwise8 protocols
show a 3.5% energy savings over Full-Map, while the ACKwise
protocol only shows
a 1.5% improvement. Based on the performance & energy results, ACKwise4 (with 4
hardware pointers) is chosen as the optimal protocol going ahead even though other
variants of ACKwise show similar trends.
The ACKwise4 protocol also reduces the area overhead for tracking sharers from
128 KB per core for the full-map directory to a mere 16 KB.
43
3.5.2
Sensitivity to Broadcast Support
Here, the sensitivity of the ACKwise protocol to broadcast support is evaluated.
This is done by comparing the performance and energy efficiency of ACKwisek on a
mesh without broadcast support to the previously evaluated scheme (with broadcast
In the absence of specialized broadcast support, a broadcast is handled
support).
by sending separate unicast messages to all the cores in the system. These unicast
messages have to be serialized through the network interface of the sender.
This
increases their latency and consumes on-chip network bandwidth.
E
Instructions
0 Branch Speculation
Compute Stalls
0 Li-1 Fetch Stalls
U
0 Synchronization
U Idle
U
Memory Stalls
-1.8
w1.6
1.4
E.0
1.2 1
0ARNE
04
w0.6
ompletion ime of
UCT
li
b
sp11111
sIII
11111111111 111111 111111
k11s
h
lt runonamsnewr
3-6C when
Figure
rotocolnof
ACKwise
on a mesh without broadcast support.
In the figure, NB is used to indicate the
absence of broadcast support. The results are normalized to the ACKwise 4 protocol
on a broadcast-enabled mesh. The completion time is high at low values of k. This
is due to the large number of broadcasts that cause high invalidation latencies and
network traffic disruption. This is evidenced by the increase in memory stall time.
The synchronization time also increases if memory stalls occur in the critical regions.
As the value of k increases, the completion time reduces due to a reduction in the
number of broadcasts.
44
3.5.3
Comparison to DirkNB [6] Organization
Here, ACKwise is compared to an alternate limited-directory organization called
DirkNB. DirkNB only tracks the unique identities of k sharers like ACKwisek. However, if the number of sharers exceeds the number of hardware pointers k, an existing
sharer is invalidated to make space for the new sharer. This technique always ensures that all the sharers of a cache line can be accommodated within the hardware
pointers, thereby not requiring support for broadcasts. However, this organization
does not work well for cache lines that are widely shared with significant number of
read accesses since DirkNB can only accommodate a maximum of k readers before
invalidating other readers.
* Instructions
* Branch Speculation
E
6 Compute Stalls
Li-I Fetch Stalls
M Synchronization
U
Memory Stalls
Idle
100
80
0 60
ELI
S40
20
VOLREND
RAYTRACE
OCEAN-NC
FFT
RADX
LU-C
LU-Nc
CANNEAL
AVERAGE
Figure 3-7: Completion time of the DirkNB protocol when k is varied as 2, 4, 8 & 16.
Results are normalized to the ACKwise 4 protocol.
Figures 3-7 plot the completion time and energy of DirkNB as a function of k.
Results are normalized to the ACKwise 4 protocol.
The value k is varied from 2
to 16 in powers of 2. The DirkNB protocol performs poorly and exhibits high Li-I
(instruction) cache stall time due to the wide sharing of instructions between program
threads. Memory stalls are found to be significant in the
RADIX
benchmark as well
due to the presence of widely-shared read-mostly data. Completion time reduces as k
is increased since more sharers can be accommodated before the need for invalidation.
However, the ACKwise 4 protocol performs at least an order of magnitude better than
all DirkNB variants.
45
3.6
Summary
This chapter proposes the ACKwise limited-directory based coherence protocol. ACKwise, by using a limited set of hardware pointers to track the sharers of a cache line,
incurs much less area and energy overhead than the conventional directory-based protocol. If the number of sharers exceeds the number of hardware pointers, ACKwise
does not track the identities of the sharers anymore. Instead, it tracks the number of
sharers. On an exclusive request, an invalidation request is broadcast to all the cores.
However, acknowledgements need to be sent by only the actual sharers since the number of sharers is tracked. The invalidation broadcast is handled efficiently by making
simple changes to an electrical mesh network. Evaluations on a 256-core multicore
show that ACKwise 4 incurs only 16 KB storage per core compared to the 128 KB
storage required for the full-map directory. ACKwise 4 also matches the performance
of the full-map directory while expending 3.5% less energy.
46
Chapter 4
Locality-aware Private Cache
Replication
Motivation
4.1
01
02,3
E4,5
6,7
E>=8
100%
80%
CJ
60%
40%2
20%
--
-
S0%
C
0o
Figure 4-1: Invalidations vs Utilization.
First, the need for a locality-aware allocation of cache lines in the private caches
of a shared memory multicore processor is motivated. The locality of each cache line
is quantified using a utilization metric. The utilization is defined as the number of
accesses that are made to the cache line by a core after being brought into its private
47
NJ
0 2,3
,5
E6,7
B>=8
100%
80%
W
20%
0
%
-
-
40%
F
-
.
-,
o0
F _'
f
-
AV
-
0
C
%
c~
C,
'0
,O
Figure 4-2: Evictions vs Utilization.
cache hierarchy and before being invalidated or evicted.
Figures 4-1 and 4-2 show the percentage of invalidated and evicted cache lines as
a function of their utilization. We observe that many cache lines that are evicted or
invalidated in the private caches exhibit low locality (e.g., in
STREAMCLUSTER,
80%
of the cache lines that are invalidated have utilization < 4). To avoid the performance
penalties of invalidations and evictions, we propose to only bring those cache lines that
have high spatio-temporal locality into the private caches and not replicate those with
low locality. This is accomplished by tracking vital statistics at the private caches and
the on-chip directory to quantify the utilization of data at the granularity of cache
lines. This utilization information is subsequently used to classify data as cache line
or word accessible.
4.2
Protocol Operation
We first define a few terms to facilitate describing our protocol.
" Private Sharer: A private sharer is a core which is granted a private copy of
a cache line in its Li cache.
" Remote Sharer: A remote sharer is a core which is NOT granted a private
copy of a cache line. Instead, its Li cache miss is handled at the shared L2
48
Core
Compute Pipeline
Li D-Cache
Home
Li I-Cache
HomeL
L2 Shared Cache
Directory
I
Mem
Router
Figure 4-3: (, g and @ are mockup requests showing the two modes of accessing on-chip
caches using our locality-aware protocol. Since the black data block has high locality with
respect to the core at T, the directory at the home-node hands out a private copy of the
cache line. On the other hand, the low-locality red data block is always cached in a single
location at its home-node, and all requests (@, @) are serviced using roundtrip remote-word
accesses.
cache location using word access.
" Private Utilization: Private utilization is the number of times a cache line is
used (read or written) by a core in its private LI cache before it gets invalidated
or evicted.
* Remote Utilization: Remote utilization is the number of times a cache line
is used (read or written) by a core at the shared L2 cache before it is brought
into its LI cache or gets written to by another core.
" Private Caching Threshold (PCT): The private utilization above or equal
to which a core is "promoted" to be a private sharer, and below which a core is
"demoted" to be a remote sharer of a cache line.
Note that a cache line can have both private and remote sharers. We first describe
the basic operation of the protocol. Later, we present heuristics that are essential for
a cost-efficient hardware implementation. Our protocol starts out as a conventional
directory protocol and initializes all cores as private sharers of all cache lines (as
shown by Initial in Figure 4-4). Let us understand the handling of read and write
49
Initial
Utilization < PCT
Utilization >= PCT
Utilization < PCT
Utilization >= PCT
Figure 4-4: Each cache line is initialized to Private with respect to all sharers. Based on
the utilization counters that are updated on each memory access to this cache line and the
parameter PCT, the sharers are transitioned between Private and Remote modes. Here
utilization = (private + remote) utilization.
requests under this protocol.
4.2.1
Read Requests
When a core makes a read request and misses in its private Li cache, the request is
sent to the L2 cache. If the cache line is not present in the.L2 cache, it is brought
in from off-chip memory. The L2 cache then hands out a private read-only copy of
the cache line if the core is marked as a private sharer in its integrated directory (0
in Figure 4-3). (Note that when a cache line is brought in from off-chip memory, all
cores start out as private sharers). The core then tracks the locality of the cache line
by initializing a private utilization counter in its Li cache to 1 and incrementing this
counter for every subsequent read. Each cache line tag is extended with utilization
tracking bits for this purpose as shown in Figure 4-5.
State
LR U
Tag
Figure 4-5: Each Li cache tag is extended to include additional bits for tracking (a) private
utilization, and (b) last-access time of the cache line.
On the other hand, if the core is marked as a remote sharer, the integrated directory either increments a core-specific remote utilization counter or resets it to 1
based on the outcome of a Timestamp check (that is described below). If the remote
utilization counter has reached PCT, the requesting core is promoted, i.e., marked
50
State
ACKwise
Pointers
Tag
1..p
Figure 4-6: A CKwisep - Complete classifier directory entry. The directory entry contains
the state, tag, ACKwise, pointers as well as (a) mode (P/R), (b) remote utilization counters
and (c) last-access timestamps for tracking the locality of all the cores in the system.
as a private sharer and a copy of the cache line is handed over to it (as shown in
Figure 4-4). Otherwise, the L2 cache replies with the requested word (g
and
@ in
Figure 4-3).
The Timestamp check that must be satisfied to increment the remote utilization
counter is as follows. The last-access time of the cache line in the L2 cache
is greater than the minimum of the last-access times of all valid cache
lines in the same set of the requesting core's Li cache.
Note that if at
least one cache line is invalid in the Li cache, the above condition is trivially true.
Each directory entry is augmented with a per-core remote utilization counter and a
last-access timestamp (64-bits wide) for this purpose as shown in Figure 4-6. Each
LI cache tag also contains a last-access timestamp (shown in Figure 4-5) and this
information is used to calculate the above minimum last-access time in the LI cache.
This minimum is then communicated to the L2 cache on an Li miss.
The above Timestamp check is added so that when a cache line is brought into the
Li cache, other cache lines that are equally or better utilized are not evicted, i.e., the
cache line does not cause LI cache pollution. For example, consider a benchmark that
is looping through a data structure with low locality. Applying the above Timestamp
check allows the system to keep a subset of the working set in the LI cache. Without
the Timestamp check, a remote sharer would be promoted to be a private sharer after
a few accesses (even if the other lines in the LI cache are well utilized). This would
result in cache lines evicting each other and ping-ponging between the LI and L2
51
caches.
4.2.2
Write Requests
When a core makes a write request that misses in its private Li cache, the request
is sent to the L2 cache. The directory performs the following actions if the core is
marked as a private sharer: (1) it invalidates all the private sharers of the cache line,
(2) it sets the remote utilization counters of all its remote sharers to '0', and (3) it
hands out a private read-write copy of the line to the requesting core. The core then
tracks the locality of the cache line by initializing the private utilization counter in its
Li cache to 1 and incrementing this counter on every subsequent read/write request.
On the other hand, if the core is marked as a remote sharer, the directory performs
the following actions: (1) it invalidates all the private sharers, (2) it sets the remote
utilization counters of all remote sharers other than the requesting core to '0', and
(3) it increments the remote utilization counter for the requesting core, or resets it
to 1 using the same Timestamp check as described earlier for read requests. If the
utilization counter has reached PCT, the requesting core is promoted and a private
read-write copy of the cache line is handed over to it. Otherwise, the word to be
written is stored in the L2 cache.
When a core writes to a cache line, the utilization counters of all remote sharers (other than the writer itself) must be set to '0' since they have been unable to
demonstrate enough utilization to be promoted. All remote sharers must now build
up utilization again to be promoted.
4.2.3
Evictions and Invalidations
When the cache line is removed from the private Li cache due to eviction (conflict
or capacity miss) or invalidation (exclusive request by another core), the private
utilization counter is communicated to the directory along with the acknowledgement
message. The directory uses this information along with the remote utilization counter
present locally to classify the core as a private or remote sharer in order to handle
52
future requests.
If the (private + remote) utilization is > PCT, the core stays as a private sharer,
else it is demoted to a remote sharer (as shown in Figure 4-4). The remote utilization
is added because if the cache line had been brought into the private Li cache at the
time its remote utilization was reset to 1, it would not have been evicted (due to the
Timestamp check and the LRU replacement policy of the Li cache) or invalidated
any earlier. Therefore, the actual utilization observed during this classification phase
includes both the private and remote utilization.
Performing classification using the mechanisms described above is expensive due
to the area overhead, both at the Li cache and the directory.
Each Li cache tag
needs to store the private utilization and last-access timestamp (a 64-bit field). And
each directory entry needs to track locality information (i.e., mode, remote utilization
and last-access timestamp) for all the cores. We now describe heuristics to facilitate
a cost-effective hardware implementation.
In Section 4.3, we remove the need for tracking the last-access time from
the Li cache and the directory. The basic idea is to approximate the outcome
of the Timestamp check by using a threshold higher than PCT for switching from
remote to private mode.
This threshold, termed Remote Access Threshold (RAT)
is dynamically learned by observing the Li cache set pressure and switches between
multiple levels so as to optimize energy and performance.
In Section 4.4, we describe a mechanism for predicting the mode (private/ remote)
of a core by tracking locality information for only a limited number of cores at the
directory.
4.3
Predicting Remote-Private Transitions
The Timestamp check described earlier served to preserve well utilized lines in the Li
cache. We approximate this mechanism by making the following two changes to the
protocol: (1) de-coupling the threshold for remote-to-private mode transition from
that for private-to-remote transition, (2) dynamically adjusting this threshold based
53
on the observed Li cache set pressure.
The threshold for remote-to-private mode transition, i.e., the number of accesses
at which a core transitions from a remote to private sharer, is termed Remote Access
Threshold (RAT). Initially, RAT is set equal to PCT (the threshold for private-toremote mode transition).
On an invalidation, if the core is classified as a remote
sharer, RAT is unchanged. This is because the cache set has an invalid line immediately following an invalidation leading to low cache set pressure.
Hence, we can
assume that the Timestamp check trivially passes on every remote access.
However, on an eviction, if the core is demoted to a remote sharer, RAT is increased to a higher level. This is because an eviction signifies higher cache set pressure.
By increasing RAT to a higher level, it becomes harder for the core to be promoted
to a private sharer, thereby counteracting the cache set pressure. If there are backto-back evictions, with the core demoted to a remote sharer on each of them, RAT
is further increased to higher levels. However, RAT is not increased beyond a certain
value (RATmax) due to the following two reasons: (1) the core should be able to return
to the status of a private sharer if it later shows good locality, and (2) the number
of bits needed to track remote utilization should not be too high. Also, beyond a
particular RAT, keeping the core as a remote sharer counteracts the increased cache
pressure negligibly, leading to only small improvements in performance and energy.
The protocol is also equipped with a short-cut in case an invalid cache line exists
in the Li cache. In this case, if remote utilization reaches or rises above PCT, the
requesting core is promoted to a private sharer since it will not cause cache pollution. The number of RAT levels used is abbreviated as nRA Teveis. RAT is additively
increased in equal steps from PCT to RATmax, the number of steps being equal to
( nRATievels -1).
On the other hand, if the core is classified as a private sharer on an eviction or
invalidation, RAT is reset to its starting value of PCT. Doing this is essential because
it provides the core the opportunity to re-learn its classification. Varying the RAT in
this manner removes the need to track the last-access time both in the Li cache tag
and the directory. However, a field that identifies the current RAT-level now needs to
54
State
ACKwise
Pointers
Tag
1..p
Figure 4-7: The limited locality classifier extends the directory entry with mode, utilization,
and RAT-level bits for a limited number of cores. A majority vote of the modes of tracked
cores is used to classify new cores as private or remote sharers.
be added to each directory entry. These bits now replace the last-access timestamp
field in Figure 4-6. The efficacy of this scheme is evaluated in Section 4.11.3. Based
on our observations, RATma
= 16 and
nRATeveis = 2 were found to produce results
that closely match those produced by the Timestamp-based classification scheme.
4.4
Limited Locality Classifier
The classifier described earlier which keeps track of locality information for all cores
in the directory entry is termed the Complete locality classifier.
It has a storage
overhead of 60% (calculated in Section 6.2.3) at 64 cores and over 10x at 1024 cores.
In order to mitigate this overhead, we develop a classifier that maintains locality
information for a limited number of cores and classifies the other cores as private or
remote sharers based on this information.
The locality information for each core consists of (1) the core ID, (2) a mode
bit (P/R), (3) a remote utilization counter, and (4) a RAT-level. The classifier that
maintains a list of this information for a limited number of cores (k) is termed the
Limitedk classifier. Figure 4-7 shows the information that is tracked by this classifier.
The sharer list of ACKwise is not reused for tracking locality information because
of its different functionality. While the hardware pointers of ACKwise are used to
maintain coherence, the limited locality list serves to classify cores as private or
remote sharers. Decoupling in this manner also enables the locality-aware protocol to
55
be implemented efficiently on top of other scalable directory organizations. However,
the locality-aware protocol enables ACKwise to be implemented more efficiently as
will be described in Section 4.8. We now describe the working of the limited locality
classifier.
At startup, all entries in the limited locality list are free and this is denoted by
marking all core IDs' as INVALID. When a core makes a request to the L2 cache, the
directory first checks if the core is already being tracked by the limited locality list.
If so, the actions described in Section 4.2 are carried out. Else, the directory checks
if a free entry exists. If it does exist, it allocates the entry to the core and the actions
described in Section 4.2 are carried out.
Otherwise, the directory checks if a currently tracked core can be replaced. An
ideal candidate for replacement is a core that is currently not using the cache line.
Such a core is termed an inactive sharer and should ideally relinquish its entry to a
core in need of it. A private sharer becomes inactive on an invalidation or an eviction.
A remote sharer becomes inactive on a write by another core.
If a replacement
candidate exists, its entry is allocated to the requesting core. The initial mode of the
core is obtained by taking a majority vote of the modes of the tracked cores. This is
done so as to start off the requester in its most probable mode.
Finally, if no replacement candidate exists, the mode for the requesting core is
obtained by taking a majority vote of the modes of all the tracked cores. The limited
locality list is left unchanged.
The storage overhead for the Limited, classifier is directly proportional to the
number of cores (k) for which locality information is tracked. In Section 4.11.4, we
will evaluate the accuracy of the Limited, classifier. Based on our observations, the
Limited3 classifier produces results that closely match and sometimes exceeds those
produced by the Complete classifier.
56
4.5
Selection of PCT
The Private Caching Threshold (PCT) is a parameter to our protocol and combining
it with the observed spatio-temporal locality of cache lines, our protocol classifies data
as private or remote. The extent to which our protocol improves the performance and
energy consumption of the system is a complex function of the application characteristics, the most important being its working set and data sharing and access patterns.
In Section 4.11, we will describe how these factors influence performance and energy
consumption as PCT is varied for the evaluated benchmarks. We will also show that
choosing a static PCT of 4 for the simulated benchmarks meets our performance and
energy consumption improvement goals.
4.6
4.6.1
Overheads of the Locality-Based Protocol
Storage
The locality-aware protocol requires extra bits at the directory and private caches to
track locality information. At the private Li cache, tracking locality requires 2 bits
for the private utilization counter per cache line (assuming an optimal PCT of 4).
At the directory, the Limited3 classifier tracks locality information for three sharers.
Tracking one sharer requires 4 bits to store the remote utilization counter (assuming
an RATm,
of 16), 1 bit to store the mode, 1 bit to store the RAT-level (assuming
2 RAT levels) and 6 bits to store the core ID (for a 64-core processor).
the Limited3 classifier requires an additional 36 (=
Hence,
3 x 12) bits of information per
directory entry. The Complete classifier, on the other hand, requires 384 (=
64 x 6)
bits of information. The assumptions stated here will be justified in the evaluation
section.
The following calculations are for one core but they are applicable for the entire
system since all cores are identical. The sizes for the per-core Li-I, Li-D and L2 caches
used in our system are shown in Table 4.1. The directory is integrated with the L2
cache, so each L2 cache line has an associated directory entry. The storage overhead
57
Li-I Cache
16 KB
L1-D Cache
32 KB
L2 Cache
256 KB
Total
304 KB
Line Size
64 bytes
Table 4.1: Cache sizes per core.
Full-Map
ACKwise 4
32 KB
336 KB
12 KB
316 KB
Directory
Overall
ACKwise 4
Complete
236 KB
540 KB
ACKwise 4
Limited 3
32 KB
336 KB
Table 4.2: Storage required for the caches and coherence protocol per core.
in the Li-I and L1-D caches is
2
512
x (16 + 32) = 0.19KB. We neglect this in future
calculations since it is really small. The storage overhead in the directory for the
Limited3 classifier is
3 256
18KB. For the Complete classifier, it is 192KB. Now,
the storage required for the ACKwise 4 protocol in this processor is 12KB (assuming
24 bits per directory entry) and that for the Full-Map protocol is 32KB. Adding up
all the storage components, the Limited 3 classifier with the A CKwise 4 protocol
uses less storage than the Full-map protocol and 5.7% more storage than
the baseline ACKwise
4
protocol (factoring in the Li-I, L1-D and L2 cache sizes
also). The Complete classifier with ACKwise 4 uses 60% more storage than the baseline
ACKwise 4 protocol. Table 4.2 shows the storage overheads of the protocols of interest.
4.6.2
Cache & Directory Accesses
Updating the private utilization counter in a cache requires a read-modify-write operation on every cache hit. This is true even if the cache access is a read. However,
the utilization counter, being just 2 bits in length, can be stored in the tag array.
Since the tag array already needs to be written on every cache hit to update the
replacement policy (e.g., LRU) counters, our protocol does not incur any additional
cache accesses.
On the directory side, lookup/update of the locality information can be performed
during the same operation as the lookup/update of the sharer list for a particular
cache line. However, the lookup/update of the directory entry is now more expensive
since it includes both the sharer list and the locality information.
58
This additional
expense has been accounted for in the evaluation.
4.6.3
Network Traffic
The locality-aware protocol could create network traffic overhead due to the following
three reasons:
1. The private utilization counter has to be sent along with the acknowledgement
to the directory on an invalidation or an eviction.
2. In addition to the cache line address, the cache line offset and the memory
access length has to be communicated during every cache miss. This is because
the requester does not know whether it is a private or remote sharer (only the
directory maintains this information as explained previously).
3. The data word(s) to be written has (have) to be communicated on every cache
miss due to the same reason.
Some of these overheads can be hidden while others are accounted for during
evaluation as described below.
1. Sending back the utilization counter can be accomplished without creating additional network flits. For a 48-bit physical address (standard for most 64-bit
x86 processors) and 64-bit flit size, an invalidation message requires 42 bits for
the physical cache line address, 12 bits for the sender and receiver core IDs and
2 bits for the utilization counter. The remaining 8 bits suffice for storing the
message type.
2. The cache line offset needs to be communicated but not the memory access
length. We profiled the memory access lengths for the benchmarks evaluated
and found it to be 64 bits in the common case. (Note that 64 bits is the processor
word size.) Memory accesses that are < 64 bits in length are rounded-up to 64
bits while those > 64 bits always fetch an entire cache line. Only 1 bit is needed
to indicate this difference. Hence, a request to the directory on a cache miss
59
uses 48 bits for the physical address (including offset), and 12 bits for the sender
and receiver core IDs. The remaining 4 bits suffice for storing the message type.
3. The data word to be written (64 bits in length) is always communicated to the
directory on a write miss in the Li cache. This overhead is accounted for in our
evaluation.
4.6.4
Transitioning between Private/Remote Modes
In our protocol no additional coherence protocol related messages are invoked when
a sharer transitions between private and remote modes.
4.7
Simpler One-Way Transition Protocol
The complexity of the above protocol could be decreased if cores, once they were
classified as remote sharers w.r.t. a cache line would stay in the same mode throughout
the program. If this were true, the storage required to track locality information at
the directory could be avoided except for the mode bits. The bits to track private
utilization at the cache tags would still be required to demote a core to the status of
a remote sharer. We term this simpler protocol Adapt, -,,, and in Section 4.11.5, we
observe that this protocol is worse than the original protocol by 34% in completion
time and 13% in energy. Hence, a protocol that incorporates dynamic transitioning
between both modes is required for efficient operation.
4.8
Synergy with ACKwise
The locality-aware coherence protocol reduces the number of private sharers of lowlocality cache lines and increases the number of remote sharers.
This is beneficial
for the ACKwise protocol since it reduces the number of invalidations as well as the
number of overflows of the limited directory. Section 4.11.6 evaluates this synergy.
60
4.9
Potential Advantages of Locality-Aware Cache
Coherence
The locality-aware coherence protocol has the following key advantages over the conventional private-caching protocols.
1. By allocating cache lines only for high locality sharers, the protocol prevents
the pollution of caches with low locality data and makes better use of their
capacity.
2. It reduces overall system energy consumption by reducing the amount of network traffic and cache accesses.
The network traffic is reduced by removing
invalidation, flush and eviction traffic for low locality data as well as by returning/storing only a word instead of an entire cache line.
3. Removing invalidation and flush messages and returning a word instead of a
cache line also improves the average memory latency.
4.10
Evaluation Methodology
We evaluate a 64-core multicore using in-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The parameters specific to the
locality-aware cache coherence protocol are shown in Table 4.3.
Architectural Parameter
value
Private Caching Threshold
PCT = 4
Max Remote Access Threshold
RA TmaX = 16
Number of RAT Levels
Classifier
nRATevels = 2
Limited 3
Table 4.3: Locality-aware cache coherence protocol parameters
61
4.10.1
Evaluation Metrics
Each multithreaded benchmark is run to completion using the input sets from Table 2.3.
We measure the energy consumption of the memory system including the
on-chip caches, and the network. For each simulation run, we measure the Completion
Time, i.e., the time in parallel region of the benchmark; this includes the compute
latency, the memory access latency, and the synchronization latency.
The memory
access latency is further broken down into four components.
1. Li to L2 cache latency is the time spent by the LI cache miss request to the
L2 cache and the corresponding reply from the L2 cache including time spent
in the network and the first access to the L2 cache.
2. L2 cache waiting time is the queueing delay incurred because requests to the
same cache line must be serialized to ensure memory consistency.
3. L2 cache to sharers latency is the round-trip time needed to invalidate
private sharers and receive their acknowledgments.
This also includes time
spent requesting and receiving synchronous write-backs.
4. L2 cache to off-chip memory latency is the time spent accessing memory
including the time spent communicating with the memory controller and the
queueing delay incurred due to finite off-chip bandwidth.
We also measure the energy consumption of the memory system which includes
the on-chip caches and the network. Our goal with the A CKwise and the Localityaware Private Cache Replication protocol is to minimize both Completion Time and
Energy Consumption.
One of the important memory system metrics we track to evaluate our protocol
is the various cache miss types. They are as follows:
1. Cold misses are cache misses that occur to a line that has never been previously
brought into the cache.
62
2. Capacity misses are cache misses to a line that was brought in previously but
later evicted to make room for another line.
3. Upgrade misses are cache misses to a line in read-only state when an exclusive
request is made for it.
4. Sharing misses are cache misses to a line that was brought in previously but
was invalidated or downgraded due to an read/write request by another core.
5. Word misses are cache misses to a line that was remotely accessed previously.
4.11
Results
The architectural parameters from Table 2.1 are used for the study unless otherwise
stated.
In Section 4.11.1, a sweep study is performed to understand the trends in
Energy and Completion Time for the evaluated benchmarks as PCT (Private Caching
Threshold) is varied. In Section 4.11.3, the approximation scheme for the Timestampbased classification is evaluated and the optimal number of remote access threshold
levels (nRA Teves) and maximum RAT threshold (RA
Tmax)
is determined.
Next,
in Section 4.11.4, the accuracy of the limited locality tracking classifier (Limitedk)
is evaluated by performing a sensitivity analysis on k. Section 4.11.5 compares the
Energy and Completion Time of the locality-aware protocol against the simpler oneway transition protocol (Adaptway). The best value of PCT obtained in Section 4.11.1
is used for the experiments in Sections 4.11.3, 4.11.4, and 4.11.5.
Section 4.11.6 evaluates the synergy between the ACKwise and the locality-aware
coherence protocol.
4.11.1
Energy and Completion Time Trends
Figures 6-8 and 6-9 plot the energy and completion time of the evaluated benchmarks
as a function of PCT. Results are normalized in both cases to a PCT of 1 which
corresponds to the baseline R-NUCA system with the ACKwise 4 directory protocol.
63
M L1-D Cache
H LU-1 Cache
L2 Cache
U Directory
Network Router
N Network Link
1.2
0
I..EI
AM.
C
111.8
12345:678
I
I.lll
IW
1
123:45678
1231456718 :1234'5678
1.2:34567~8
12345678
1234567'8
OCEAN-NC
WATER-SP
RAYTRACE
BLACKSCH.
STREAMCLUS.
DEDUP
2l3451678
1234678
123456718
12345678
CONCOMP
COMMUNITY
RADIX
LU-NC
BARNES
12345678
12345678
123456 78
CANNEAL
DUKSTRA-SS
DIJKSTRA-AP
1:234'5-678
12,345678
123i4I5678
12I345678
BODYTRACK
FLUIDANIM.
1.2
0
LJ
12345678 12 34567 8,
PATRICIA
SUSAN
TSP
DS
1 23456 78
MATMUL
1,2345-6178
AVERAGE
4-8: Variation of Energy with PCT. Results are normalized to a PCT of 1.
Figure
0.6
Note that Average and not Geometric-Mean is plotted here.
8m
Compute
U LlCache-L2Cache
L2Cache-Waiting
L2Cache-Sharers
L2Cache-OffChip
U Synchronization
1.4
RADIX
LU-NC
BARNES
OCEAN-NC
RAYTRACE
BLACKSCH.
STREAMCLUS.
DEDUP
12345678
123415678
12345678
123456718
WATER-SP
BODYTRACK FLUIDANIM.
1.2
12345:678 123456718
CANNEAL
123415678
DIJKSTRASSDIJKSTRA-AP
1234:67T8 12345678
PATRICIA
CONCOMP
SUSAN
COMMUNITY
TSP
DFS
12345678
MATMUL
234:5:678
AVERAGE
Figure 4-9: Variation of Completion Time with PCT. Results are normalized to a
PCT of 1. Note that Average and not Geometric-Mean is plotted here.
Both energy and completion time decrease initially as POT is increased. As POT
increases to higher values, both completion time and energy start to increase.
Energy
We consider the impact that our protocol has on the energy consumption of the memory system (L-I cache, L1-D cache, Lacheche and directory) and the interconnection
64
Cold E Capacity 0 Upgrade 1 Sharing
3
0.4
03
0
-o
U
Word
15
2.5
2
1.5
0.2
10
0.1
I%
123468
WATER-SP
123468
123468
SUSAN
BLACKSCH
123468
FLUIDANIM.
12 3 4 68
CANNEAL
12 34 6 8
123141618
COMMUNITY,
RAYTRACE
1234I68
123468
123468
PATRICIA
DFS
TSP
8
40
60
6
30
40
04
20
C0
10
j
30
0
1234468
RADIX
1 2*618
BARNES
12.34
STREAMCLUS.
1213468
DEDUP
1:23 46.8
12314 618
BODYTRACK
DIJKSTRA-SS
1234 68
DIJKSTRAAP
123468
LU-NC
1234168
123468:
123468
OCEAN-NC
MATMUL
CONCOMP
Figure 4-10: Li Data Cache Miss Rate and Miss Type Breakdown vs PCT. Note that
in this graph, the miss rate increases from left to right as well as from top to bottom.
network (both router and link). The distribution of energy between the caches and
network varies across benchmarks and is primarily dependent on the L1-D cache miss
rate.
For this purpose, the L1-D cache miss rate is plotted along with miss type
breakdowns in Figure 4-10.
Benchmarks such as WATER-SP and SUSAN with low cache miss rates (-0.2%)
dissipate 95% of their energy in the Li caches while those such as CONCOMP and LUNC with higher cache miss rates dissipate more than half of their energy in the network.
The energy consumption of the L2 cache compared to the Li-I and L1-D caches is
also highly dependent on the L1-D cache miss rate.
For example, WATER-SP has
negligible L2 cache energy consumption, while OCEAN-NC's L2 energy consumption
is more than its combined Li-I and L1-D energy consumption.
At the 11nm technology node, network links have a higher contribution to the
energy consumption than network routers. This can be attributed to the poor scaling
trends of wires compared to transistors. As shown in Figure 6-8, this trend is observed
in all our evaluated benchmarks.
The energy consumption of the directory is negligible compared to all other sources
of energy consumption. This motivated our decision to put the directory in the L2
65
cache tag arrays as described earlier. The additional bits required to track locality
information at the directory have a negligible effect on energy consumption.
Varying PCT impacts energy by changing both network traffic and cache accesses.
In particular, increasing the value of PCT decreases the number of private sharers of
a cache line and increases the number of remote sharers. This impacts the network
traffic and cache accesses in the following three ways.
(1) Fetching an entire line
on a cache miss in conventional coherence protocols is replaced by multiple word
accesses to the shared L2 cache. Note that each word access at the shared L2 cache
requires a lookup and an update to the utilization counter in the directory as well. (2)
Reducing the number of private sharers decreases the number of invalidations (and
acknowledgments) required to keep all cached copies of a line coherent. Synchronous
write-back requests that are needed to fetch the most recent copy of a line are reduced
as well. (3) Since the caching of low-locality data is eliminated, the Li cache space
can be more effectively used for high locality data, thereby decreasing the amount of
asynchronous evictions (leading to capacity misses) for such data.
Benchmarks that yield a significant improvement in energy consumption do so by
converting either capacity misses (in BODYTRACK and BLACKSCHOLES) or sharing
misses (in DIJKSTRA-SS and STREAMCLUSTER) into cheaper word misses. This can
be observed from Figure 6-8 when going from a PCT of 1 to 2 in BODYTRACK and
BLACKSCHOLES and a PCT of 2 to 3 in DIJKSTRA-SS and STREAMCLUSTER.
While a
sharing miss is more expensive than a capacity miss due to the additional network traffic generated by invalidations and synchronous write-backs, turning capacity misses
into word misses improves cache utilization (and reduces cache pollution) by reducing
evictions and thereby capacity misses for other cache lines. This is evident in benchmarks
like BLACKSCHOLES, BODYTRACK, DIJKSTRA-AP and MATMUL in which the
cache miss rate drops when switching from a PCT of 1 to 2. Benchmarks like LU-NC,
and PATRICIA provide energy benefit by converting both capacity and sharing misses
into word misses.
At a PCT of 4, the geometric mean of the energy consumption across all benchmarks is less than that at a PCT of 1 by 25%.
66
Completion Time
As shown in Figure 6-9, our protocol reduces the completion time as well. Noticeable
improvements (>5%) occur in 11 out of the 21 evaluated benchmarks. Most of the improvements occur for the same reasons as discussed for energy, and can be attributed
to our protocol identifying low locality cache lines and converting the capacity and
sharing misses on them to cheaper word misses. Benchmarks such as BLACKSCHOLES,
DIJKSTRA-AP
and MATMUL experience a lower miss rate when PCT is increased from
1 to 2 due to better cache utilization. This translates into a lower completion time.
In
CONCOMP,
cache utilization does not improve but capacity misses are converted
into almost an equal number of word misses. Hence, the completion time improves.
Benchmarks such as STREAMCLUSTER and TSP show completion time improvement due to converting expensive sharing misses into word misses. From a performance standpoint, sharing misses are expensive because they increase:
(1) the L2
cache to sharers latency and (2) the L2 cache waiting time. Note that the L2 cache
waiting time of one core may depend on the L2 cache to sharers latency of another
since requests to the same cache line need to be serialized.
In these benchmarks,
even if cache miss rate increases with PCT, the miss penalty is lower because a word
miss is much cheaper than a sharing miss. A word miss does not contribute to the
L2 cache to sharers latency and only contributes marginally to the L2 cache waiting
time. Hence, the above two memory access latency components can be significantly
reduced.
Reducing these components may decrease synchronization time as well if
the responsible memory accesses lie within the critical section.
STREAMCLUSTER
and DIJKSTRA-SS mostly reduce the L2 cache waiting time while PATRICIA and TSP
reduce the L2 cache to sharers latency.
In a few benchmarks such as LU-NC and BARNES,
completion time is found to
increase after a PCT of 3 because the added number of word misses overwhelms any
improvement obtained by reducing capacity misses.
At a PCT of 4, the geometric mean of the completion time across all benchmarks
is less than that at a PCT of 1 by 15%.
67
*Energy
-Completion Time
1.2
1.1
1
0.9
0.8
0.7
0.6
1
2
3
4
5
6
8
7
10
12
14
16
18
20
Private Caching Threshold (PCT)
Figure 4-11: Variation of Geometric-Means of Completion Time and Energy with
Private Caching Threshold (PCT). Results are normalized to a PCT of 1.
4.11.2
Static Selection of PCT
To put everything in perspective, we plot the geometric means of the Completion
Time and Energy for our benchmarks in Figure 4-11. We observe a gradual decrease
of completion time up to PCT of 3, constant completion time till a PCT of 4, and an
increase in completion time afterward. Energy consumption decreases up to PCT of
5, then stays constant till a PCT of 8 and after that, it starts increasing. We conclude
that a PCTof 4 meets our goal of simultaneously improving both completion time and
energy consumption. A completion time reduction of 15% and an energy consumption
improvement of 25% is obtained when moving from PCT of 1 to 4.
4.11.3
Tuning Remote Access Thresholds
-&Energy
-. Completion-Time
1.13
1.08
1.03
0.98
Timestamp
Figure 4-12:
RATmax (T)
L-1
L-2, T-8
L-2, T-16
L-4, T-8
L-4, T-16
L-8, T-16
Remote Access Threshold sensitivity study for nRATeveis (L) and
68
As explained in Section 4.3, the Timestamp-based classification scheme was expensive due to its area overhead, both at the Li cache and directory. The benefits provided by this scheme can be approximated by having multiple Remote Access Threshold (RAT) Levels and dynamically switching between them at runtime to counteract
the increased Li cache pressure. We now perform a study to determine the optimal
number of threshold levels (nRATeveis) and the maximum threshold (RA Tmax).
Figure 4-12 plots the completion time and energy consumption for the different
points of interest. The results are normalized to that of the Timestamp-based classification scheme. The completion time is almost constant throughout. However, the
energy consumption is nearly 9% higher when nRA Teves = 1. With multiple RAT
levels (nRATevei 8 > 1), the energy is significantly reduced.
Also, the energy con-
sumption with RA Tmax= 16 is found to be slightly lower (2%) than with RATmax
= 8. With RATmax = 16, there is almost no difference between nRA Teves = 2, 4, 8,
so we choose
4.11.4
nRATievei, = 2 since it minimizes the area overhead.
Limited Locality Tracking
As explained in Section 4.4, tracking the locality information for all the cores in the
directory results in an area overhead of 60% per core. So, we explore a mechanism
that tracks the locality information for only a few cores, and classifies a new core as a
private or remote sharer based on a majority vote of the modes of the tracked cores.
Figure 4-13 plots the completion time and energy of the benchmarks with the
Limitedk classifier when k is varied as (1, 3, 5, 7, 64).
k = 64 corresponds to the
Complete classifier. The results are normalized to that of the Complete classifier.
The benchmarks that are not shown are identical to
WATER-SP,
i.e., the completion
time and energy stay constant as k varies. The experiments are run with the best
static PCT value of 4 obtained in Section 4.11.2. We observe that the completion
time and energy consumption of the Limited 3 classifier never exceeds by more than
3% the completion time and energy consumption of the Complete classifier.
In STREAMCLUSTER and DIJKSTRA-SS, the Limited 3 classifier does better than
the Complete classifier because it learns the mode of sharers quicker.
69
While the
0 Limited-1 I
@1
E
0
E
M Limited-5
0 Limited-7 N Complete
8Limited-3
U Limited-5
U Limited-7
1
00m
S0.8
0.6
NA
E
Limited-3
1.2
M 0.4
E_0.2
0
0
M Limitehd-1
U Complete
0
1.4
1.2
1
Ui 0.8
0.6
0.4
0.2
0
+S.
AR
tv,
2
$
qw
Figure 4-13: Variation of Completion Time and Energy with the number of hardware
locality counters (k) in the Limitedk classifier. Limited 64 is identical to the Complete classifier. Benchmarks for which results are not shown are identical to WATER-SP, i.e., the
Completion Time and Energy stay constant as k varies.
Complete classifier starts off each sharer of a cache line independently in private
mode, the Limited 3 classifier infers the mode of a new sharer from the modes of
existing sharers. This enables the Limited 3 classifier to put the new sharer in remote
mode without the initial per-sharer classification phase. We note that the Complete
locality classifier can also be equipped with such a learning short-cut.
Inferring the modes of new sharers from the modes of existing sharers can be
harmful too, as is illustrated in the case of RADIX and BODYTRACK for the Limited1
classifier. The cache miss rate breakdowns for these benchmarks as the number of
locality counters is varied is shown in Figure 4-14. While RADIX starts off new sharers
incorrectly in remote mode, BODYTRACK starts them off incorrectly in private mode.
This is because the first sharer is classified as remote (in RADIX) and private (in
BODYTRACK).
This causes other sharers to also reside in that mode while they
70
0 Cold M Capacity M Upgrade 6 Sharing
C Cold U Capacity U Upgrade
Word
8
4
6
3,
6~~~G
U Sharing
Word
-----
4
2
00
1
3
7
5
3
1
64
5
7
64
# Hardware Locality Counters
# Hardware Locality Counters
(b)
(a) RADIX
BODYTRACK
Figure 4-14: Cache miss rate breakdown variation with the number of hardware locality
counters (k) in the Limitedk classifier. Limited6 4 is identical to the Complete classifier.
actually want to be in the opposite mode. Our observation from the above sensitivity
experiment is that tracking the locality information for three sharers suffices to offset
such incorrect classifications.
4.11.5
Simpler One-Way Transition Protocol
U Completion Time
3
2
U Energy
2
1.5
#
&
0
Figure 4-15: Ratio of Completion Time and Energy of Adapt,,a, over Adapt2 way
In order to quantify the efficacy of the dynamic nature of our protocol, we compare
the protocol to a simpler version having only one-way transitions (Adaptway). The
simpler version starts off all cores as private sharers and demotes them to remote
sharers when the utilization is less than the private caching threshold (PCT). However,
these cores then stay as remote sharers throughout the lifetime of the program and
can never be promoted. The experiment is run with the best PCT value of 4.
Figure 4-15 plots the ratio of completion time and energy for the Adaptway pro71
tocol over our protocol (which we term Adapt2 -way). Higher the ratio, higher is the
We observe that the Adapt,,a, protocol is worse in
need for two-way transitions.
completion time and energy by 34% and 13% respectively.
In benchmarks such as
BODYTRACK and DIJKSTRA-SS, the completion time ratio is worse by 3.3x and 2.3x
respectively.
4.11.6
Synergy with ACKwise
6
-0-radix
5
-++-lu-contig
*-Iu-noncontig
0~-
-*-ocean-contig
-barnes
3
3
2
1
0
1
2
3
4
5
7
6
9
8
10 11 12 13 14 15 16
Private Caching Threshold (PCT)
10000
1A
ba
-_---
1000
0-radix
+1u-contig
-+(-lu-noncontig
-*--ocean-contig
-+-barnes
100
0
10
E
2
1
2
3
4 5 6
7
8
9 10 11 12 13 14 15 16
Private Caching Threshold (PCT)
Figure 4-16: Synergy between the locality-aware coherence protocol and ACKwise.
Variation of average and maximum sharer count during invalidations as a function of
PCT
72
Figure 4-16 evaluates the synergy between the locality-aware private cache replication scheme and ACKwise. It plots the average and maximum sharer count during
invalidations as a function of PCT. As PCT increases, both the above sharer counts
decrease since low-locality data is cached remotely. At a PCT of 4, the average sharer
count in most benchmarks is reduced from 2 - 2.5 to 0.5 - 1. This reduces the number
of invalidations and the number of overflows of the limited directory.
4.12
Summary
This chapter introduced a locality-aware private cache replication protocol to improve
on-chip memory access latency and energy efficiency in large-scale multicores. This
protocol is motivated by the observation that cache lines exhibit varying degrees of
reuse (i.e., variable spatio-temporal locality) at the private cache levels. A cache-line
level classifier is introduced to distinguish between low and high-reuse cache lines.
A traditional cache coherence scheme that replicates data in the private caches is
employed for high-reuse data. Low-reuse data is handled efficiently using a remote
access
[31] mechanism.
Remote access does not allocate data in the private cache levels. Instead, it allo&
cates only a single copy in a particular core's shared cache slice and directs load
store requests made by all cores towards it. Data access is performed at the word
level and requires a roundtrip message between the requesting core and the remote
cache slice. This improves the utilization of private cache resources by removing unnecessary data replication. In addition, it reduces network traffic by transferring only
those words in a cache line that are accessed on-demand. Consequently, unnecessary
invalidations and write-back requests are removed that reduce network traffic even
further.
The locality-aware private cache replication protocol preserves the familiar programming paradigm of shared memory while using remote accesses for data access
efficiency. The protocol has been evaluated for the Sequential Consistency (SC) memory model using in-order cores with a single outstanding memory transaction per-core.
73
Evaluation on a 64-core multicore shows that the protocol reduces the overall Energy
Consumption by 25% while improving the Completion Time by 15%. The protocol
can be implemented with only 18 KB storage overhead per core when compared to
the ACKwise 4 limited directory protocol, and has a lower storage overhead than a
full-map directory protocol.
74
Chapter 5
Timestamp-based Memory Ordering
Validation
State-of-the-art multicore processors have to balance ease of programming with good
performance and energy efficiency. The programming complexity is significantly affected by the memory consistency model of the processor. The memory model dictates the order in which the memory operations of one thread appear to another. The
strongest memory model is the Sequential Consistency (SC)
[59]
model. SC mandates
that the global memory order is an interleaving of the memory accesses of each thread
with each thread's memory accesses appearing in program order in this global order.
SC is the most intuitive model to the software developer and is the easiest to program
and debug with.
Production processors do not implement SC due to its negative performance impact.
SPARC RMO [11, ARM [36] and IBM Power [651 processors implement re-
laxed (/weaker) memory models that allow reordering of load & store instructions
with explicit fences for ordering when needed.
These processors can better exploit
memory-level parallelism (MLP), but require careful programmer-directed insertion of
memory fences to do so. Automated fence insertion techniques sacrifice performance
for programmability [7].
Intel x86 [831 and SPARC [1] processors implement Total Store Order (TSO),
which attempts to strike a balance between programmability and performance. The
75
TSO model only relaxes the Store-+Load ordering of SC, and improves performance
by enabling loads (that are crucial to performance) to bypass stores in the write
buffer.
Note that fences may still be needed in critical sections of code where the
Store-*Load ordering is required.
Implementing the TSO model on out-of-order multicore processors in a straightforward manner sacrifices memory-level parallelism. This is because loads have to
wait for all previous load/fence operations to complete before being issued while
stores/fences have to wait for all previous load/store /fence operations.
This inef-
ficiency is circumvented in current processors by employing two optimizations
[31.
(1) Load performance is improved using speculative out-of-order execution, enabling
loads to be issued and completed before previous load/fence operations.
Memory
consistency violations are detected when invalidations, updates or evictions are made
to addresses in the load queue. The pipeline state is rolled back if this situation arises.
(2) Store performance is improved using exclusive store prefetch
[331 requests. These
prefetch requests fetch the cache line into the L1-D cache and can be executed outof-order and in parallel. The store requests, on the other hand, must be issued and
completed in-order to preserve TSO. However, most store requests hit in the L1-D
cache (due to the earlier prefetch request) and hence can be completed quickly. The
performance of fences is automatically improved by optimizing previous load & store
operations. Note that the above two optimizations can also be employed to improve
the performance of processors under sequential consistency or memory models weaker
than TSO.
5.1
Principal Contributions
This chapter explores how to extend the locality-aware private cache replication
scheme for out-of-order speculative processors for popular memory models.
Unfor-
tunately, the remote (cache) access required by the locality-aware scheme is incompatible with the two optimizations described earlier.
Since private cache copies of
cache lines are not always maintained, invalidation /update requests cannot be used
76
to detect memory consistency violations for speculatively executed load operations.
In addition, an exclusive store prefetch request is not applicable since a remote access
never caches data in the private cache.
In this chapter, we present a novel technique that uses timestamps to detect memory consistency violations when speculatively executing loads under the locality-aware
protocol. Each load and store operation is assigned an associated timestamp and a
simple arithmetic check is done at commit time to ensure that memory consistency
has not been violated. This technique does not rely on invalidation/ update requests
and hence is applicable to remote accesses. The timestamp mechanism is efficient due
to the observation that consistency violations occur due to conflicting accesses that
have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. This technique works completely
in hardware and requires only 2.5 KB of storage per core. This scheme guarantees
forward progress and is starvation-free.
An alternate technique is also implemented that is based on the observation that
memory consistency violations only occur due to conflicting accesses to shared readwrite data
[84]. Accesses to private data and concurrent reads to shared read-only
data cannot cause violations. Hence, a mechanism is designed that classifies data into
3 categories: (1) private, (2) shared read-only, and (3) shared read-write. Total Store
Order (TSO) is enforced through serialization of load/ store/fence accesses for the 3rd
category while memory accesses from the first two categories can be executed outof-order. The classification is done using a hardware-software mechanism that takes
advantage of existing TLB and page table structures. The implementation ensures
precise interrupts and smooth transition between the categories. This technique can
also be used to improve the energy efficiency of the timestamp-based technique as
will be discussed later.
Our evaluation using a 64-core multicore with out-of-order speculative cores shows
that the timestamp-based technique, when implemented on top of the locality-aware
cache coherence protocol [54], improves completion time by 16% and energy by 21%
over a state-of-the-art cache management scheme (Reactive NUCA
77
[391).
The rest of this chapter is organized as follows. Section 5.2 provides background
information about out-of-order core models. Section 5.3 provides background information about formal methods to specify memory models and will be used in proving
the correctness of the timest amp-based validation technique presented in this chapter.
Section 5.4 presents the timestamp-based technique to detect memory consistency violations. This section also introduces a few variations of the technique with different
hardware complexity and performance trade-offs. Section 5.5 discusses how to implement the timestamp-based technique on other memory models and a couple of design
issues. Section 5.6 presents an alternate hardware-software co-design technique that
exploits the observation that memory consistency is only violated by conflicting accesses to shared read-write data. Section 5.7 describes the experimental methodology.
Section 5.8 presents the results. And finally, Section 5.9 summarizes the chapter.
5.2
Background: Out-of-Order Processors
To facilitate describing the mechanisms, we briefly outline the backend of a modern
out-of-order processor.
The backend consists of the Physical Register File (PRF),
Reorder Buffer (ROB), Reservation Stations (RS) for each functional unit, Speculative
Load Queue (LQ) and Store Queue (SQ).
1. Physical Register File (PRF) contains the values of all registers and is co-located
with the functional units.
2. Register Alias Table (RAT) contains the mapping from the architectural register
file (ARF) to the physical register file (PRF).
3. Reorder Buffer (ROB) ensures that instructions (/micro-ops) are committed in
program order. Each micro-op is dispatched to and retired from the re-order
buffer in program order. Program order commit is required to recover precise
state in the event of interrupts, exceptions and mis-speculations.
4. Reservation Station (RS) implements the out-of-order logic. It holds the microops before they are ready to be issued to each functional unit. Each functional
78
unit has a dedicated reservation station. Micro-ops are dispatched to the RS
in program order but can be retired out-of-order as soon as their operands are
ready.
5. Load-Store Unit handles the memory load and store operations in compliance
with the memory consistency model of the processors.
6. Execution Unit performs the integer and floating point operations.
When the result of the functional unit (execution unit
/ load-store unit) is ready,
the physical register ID of the result register is broadcasted on the common data bus
(CDB). Each reservation station compares this result register ID against the operand
IDs' of the micro-ops that are not yet ready. It marks the ready micro-ops so that
they can be issued to the relevant functional unit in the next clock cycle.
The result value of a functional unit is directly written into the physical register
file. Likewise, when a micro-op is issued to a functional unit, the operands are read
directly from the physical register file. Using a PRF in this manner reduces the data
movement within a core and the removes the need to store register values in the ROB
and RS. Only register IDs' are stored within the ROB and RS. However, since writes
are done to the PRF out-of program order, special hardware support is needed to
maintain precise register state.
5.2.1
Load-Store Unit
The load-store unit performs the load and store operations in compliance with the
memory consistency model of the processor. It contains the following components.
1. Address Unit computes the virtual address (given the register read operands).
2. Memory Management Unit (MMU) translates the virtual address into physical
address using the TranslationLookaside Buffer (TLB) and page table hierarchy.
3. Store Queue holds the store addresses and values to be written to memory in
program-order.
79
4. Load Queue serves to enforce the memory consistency model of the processor
by detecting load speculation violations.
We first define a few terms to facilitate describing the techniques.
" DispatchTime: Time at which micro-ops are allocated an entry in the re-order
buffer and reservation station. All memory operations are allocated an entry in
the load or store queues. Load and store operations are dispatched in program
order.
* Issue Time: Time at which arithmetic/memory operations are issued from the
reservation station to the execution units/cache subsystem.
For a memory
operation to be issued, the operands have to be ready.
" CompletionTime: For a load operation, it is the time at which the load receives
the data from the cache subsystem. For a store operation, it is the time at
which the store receives an acknowledgement after its value has been written to
the cache subsystem and propagated to all the cores in the system.
" Commit Time: Time at which a memory operation is committed. Operations are
committed in program order for precise state during exceptions and interrupts.
A load is committed after it is complete and all previous operations have been
committed. A store is committed as soon as its address and data operands are
ready and virtual-to-physical address translation is complete.
5.2.2
Lifetime of Load/Store Operations
The lifetime of a load operation is as follows. A load is first dispatched to the reorder
buffer/reservation station in program order. Once the operands are ready, the physical address is computed (using the address calculation and the virtual-to-physical
translation mechanisms), and the load is simultaneously issued to the load queue and
cache subsystem. The load completes when the data is returned.
committed in program order after it is completes.
80
And the load is
The lifetime of a store operation is as follows.
A store is first dispatched to
the reorder buffer/ reservation station in program order. Once the address & data
operands are ready, the physical address is computed, an exclusive store prefetch is
issued for the address and the store request (both physical address & data) is added
to the store queue. The store is committed in program order once the physical address
is ready.
Later, the store is issued to the cache subsystem after all previous stores
are completed. Note the current store can be issued only after it commits to ensure
precise exceptions. The store is completed after the data is written to the cache and
an acknowledgement is received.
5.2.3
Precise State
Maintaining precise state is required to recover from interrupts, exceptions or misspeculations. Exceptions are caused due to illegal instruction opcodes, divide-by-zero
operations, and memory faults (TLB misses, page faults, segmentation faults, etc.).
Mis-speculations are caused due to branch mispredictions and memory load ordering
violations.
The processor maintains two methods to recover precise state. The fast but costly
method involves taking snapshots of the Register Alias Table on every instruction that
could cause pipeline rollback. To recover precise state, the snapshot is simply restored.
This method is commonly used for branch and load ordering mis-speculations. The
slow but cheap method involves rolling back each architectural register's physical
register translation starting with the latest dispatched entry backwards to the entry
that initiated the pipeline rollback. This method is commonly used for exceptions
and interrupts.
5.3
Background: Memory Models Specification
In order to study which processor optimizations violate or preserve a particular memory consistency model, a formal specification of memory ordering rules is required.
In this section, two methods of specification are discussed, the operational and the
81
axiomatic specification. We will prove the correctness of our timestamp-based validation technique using the axiomatic specification approach.
5.3.1
Operational Specification
The operational specification presents an abstract memory model specification using
This formal specification is close to an
hardware structures such as store buffers.
/
actual hardware implementation and serves as a guide for hardware designers.
I
I
I
Y
X
Y
X
Y
XI
I
7
I
I
I
)
Request
- - - - -> Response
Figure 5-1: Operational Specification of the TSO Memory Consistency Model.
For example, the total store order (TSO) memory model can be specified using
three types of components that run concurrently. A processor component (Pr) produces a sequence of memory operations, a memory location component (M) keeps
track of the latest value written to a specific memory location, and a store buffer
component (WQ) implements a FIFO buffer for write messages and supports read
82
forwarding. These components are connected in the configuration described in Figure
5-1.
5.3.2
Axiomatic Specification
The axiomatic specification is programmer-centric and does not contain specifications
about store buffers. This specification lists a set of axioms that define which execution
traces are allowed by the model and in particular, which writes can be observed by
each read.
An execution trace is a sequence of memory operations (Read, Write,
Fence) produced by a program. Each operation in the trace includes an identifier of
the thread that produced that operation, and the address and value of the operation
for reads and writes.
Axiomatic specifications usually refer to the program order, <P. For two operations x and y, x <p y if both x and y belong to the same thread and x precedes y
in the execution trace. The program order, however, is not the order in which the
memory operations are observed by the main memory.
The memory order, <in, is
a total order that indicates the order in which memory operations affect the main
memory. A read observes the latest write to the same memory address according to
<m. However, a global memory order can be constructed only for those models that
have a "store atomicity" property, i.e., all threads observe writes in the same order.
Examples of store atomic memory models are sequential consistency (SC) and total
store order (TSO).
We only define memory models having the store atomicity property. Total store
order (TSO) model can be defined using two axioms: Read- Values axiom and Ordering
axiom. The read-values axiom requires the definition of a function sees(x, y, <).
sees(x, y, <) is true if y is a write and y < x or y <p x
The second part of this definition allows for load forwarding from a store (S1 ) to a
later load (L 2 ) even though the load could appear earlier in the memory order (<i).
Read Values. Given a read x and a write y to the same address as x, then x
and y have the same value if sees(x, y,
<i)
83
and there is no other write z such
that sees(x, z, <i)
sees(x, y,
<i),
and y
<m z.
If, for a read x, there is no write y such that
then the read value is 0.
Ordering. For every x and y, x <p y implies that x
<n y, unless x is a write and y
is a read.
5.4
Timestamp-based Consistency Validation
This section introduces a novel timestamp-based technique for exploiting memory
level parallelism (MLP) in processors with remote access based coherence protocols.
This technique allows all load/store operations to be executed speculatively.
The
correctness of speculation is validated by associating timestamps with every memory
transaction and performing a simple arithmetic check at commit time. This mechanism is efficient due to the observation that consistency violations occur due to
conflicting accesses-that have temporal proximity (i.e., are issued only a few cycles
apart). This enables the timestamps to be narrow and stored only for a small time
window.
We describe the working of this technique under the popular TSO memory model.
Later, in Section 5.5.1, we discuss how it can be extended to stronger (i.e., SC) or
weaker memory models.
The timestamp-based technique is built up gradually through a sequence of steps.
The microarchitecture changes needed are colored in Figure 5-2 and will be described
when introduced.
The implementation is first described for a pure remote access
scheme (that always accesses the shared L2 cache). Later, we describe adaptations
for the locality-aware protocol that combines both remote L2 and private Li cache
accesses according to the reuse characteristics of data.
5.4.1
Simple Implementation of TSO Ordering
We first introduce a straightforward method to ensure TSO ordering (without speculation). The TSO model can be implemented by enforcing the following two constraints: (1) Load operations wait (i.e., without being issued to the cache subsystem)
84
Core
L-1 Cache
m
Compute
Pipeline
ROB_
L1-D Cache
Network
Router
ROB : Reorder Buffer
SQ: Store Queue
LQ: Load Queue
L2
Cache
TC: Timestamp Counter
TC-H : Timestamp - HRP
Counter
DRQ: Directory Request
Queue
E
L1LHQ / L1SHQ: Li Cache Load / Store History Queue
L2LHQ / L2SHQ: L2 Cache Load / Store History Queue
Figure 5-2: Microarchitecture of a multicore tile. The orange colored modules are
added to support the proposed modifications.
till all previous loads and fences complete; (2) Store operations and fences wait till
all previous loads, stores and fences complete. Note that the pipeline can continue
executing as long as there are no unsatisfied dependencies (i.e., NO data hazard)
and free entries in the load and store queues (NO structural hazard). If all memory
references are satisfied by the Li cache, this scheme works quite well. However, a
memory transaction that misses within the Li cache (as is the case for remote transactions), takes several (~10-100) cycles to complete. During this time, the load and
store buffers fill up quickly and stall the pipeline. For a processor with a fetch width
of 2 instructions per cycle, and a memory reference every 3 instructions, a memory
operation has to complete within ~1.5 cycles to match the throughput of the pipeline.
5.4.2
Basic Timestamp Algorithm
Next, we formulate a mechanism to increase the performance of load operations (instead of making them wait as described above).
This requires enabling loads to
execute (speculatively) as soon as their address operand is ready, potentially out-of
program order and in parallel with other loads. This could violate the TSO memory
85
consistency model, and hence, a validation step is required to ensure that memory
ordering is preserved. Stores and fences, on the other hand, have to wait till all previous loads, stores and fences have been completed. To check whether the speculative
execution of loads violates TSO ordering, a timestanip-based validation technique is
proposed.
Timestamp Generation:
Timestamps are generated using a per-core counter, the Timestamp Counter (TC),
as shown in Figure 5-2. TC increments on every clock cycle. Assume for now that
timestamps are of infinite width, i.e., they never rollover and all cores are in a single
clock domain.
We will remove the infinite width assumption in Section 5.4.4 and
discuss the single clock domain assumption in Section 5.5.2. Timestamps are tracked
at different points of time, e.g., during load issue, store completion, etc. and comparisons are performed on these timestamps to determine whether speculation has failed
according to the algorithms discussed in the following subsections.
Microarchitecture Modifications:
The following changes are made to the L2 cache, the load queue and the store queue
to facilitate the tracking of timestamps.
L2 Cache: The shared L2 cache is augmented with an L2 Load History Queue
(L2LHQ) and an L2 Store History Queue (L2SHQ) as shown in Figure 5-2. They
track the times at which loads and stores have been performed at the L2 cache. The
timestamp assigned to a load/store is the time at which the request arrives at the
L2 cache.
(Once a request arrives, all subsequent requests to the same cache line
will be automatically ordered after it). Each entry in the L2LHQ / L2SHQ has two
attributes, Address and Timestamp. An entry is added to the L2LHQ or L2SHQ
whenever a remote load or store arrives. Assume for now, the L2LHQ
/ L2SHQ are
of infinite size. We will remove this requirement in Section 5.4.3.
Load Queue: Each load queue entry is augmented with the attributes shown in Figure 0-3. Only the Address, IssueTime and LastStoreTime fields are required for this
86
Address
Figure 5-3: Structure of a load queue entry.
algorithm. The IssueTime field records the time at which the load was issued to the
cache subsystem and the LastStoreTime is the most recent modification time of the
cache line. A remote load obtains this timestamp from the L2SHQ (L2 Store History
Queue). If there are multiple entries in the L2SHQ corresponding to the address, the
most recent entry's timestamp is taken. If no entries are found corresponding to the
address, there has not been a store so far and hence, the most recent timestamp is
'0'. This is then relayed back to the core over the on-chip network along with the
word that is read. The load queue also contains a per-core OrderingTime field which
is used for detecting speculation violations as will be described later.
Address
Data
Figure 5-4: Structure of a store queue entry.
Store Queue: Each store queue entry is augmented with the attributes shown in
Figure 5-4. The Data contains the word to be written.
The LastAccessTime field
records the most recent access time of the cache line at the L2 cache. A remote store
obtains the most recent timestamps from the L2LHQ and L2SHQ respectively. The
maximum of the two timestamps is computed to get the LastAccessTime. This is then
communicated back to the core over the on-chip network along with the acknowledgement for the store. The store queue also contains a per-core StoreOrderingTime field
which is used for detecting speculation violations.
Speculation Violation Detection:
In order to check whether speculatively executed loads have succeeded in the TSO
memory model, the IssueTime of a load must be greater than or equal to:
1. The LastStoreTime observed by previous load operations.
This ensures that
the current (speculatively executed) load can be placed after all previous load
87
operations in the global memory order.
In other words, the global memory
order for any pair of load operations from the same thread respects program
order, i.e., the Load -+Load ordering requirement of TSO is met.
2. The LastAccessTime observed by previous store operations that are separated
from the current load by a memory fence (MFENCE in x86). This ensures that
the load can be placed after previous fences in the global memory order, i.e.,
the Fence
-a
Load ordering requirement is met.
Algorithm 1 : Basic Timestamp Scheme
1: function COMMITLOAD()
2:
if Issue Time < OrderingTime then
3:
4:
5:
6:
REPLAYINSTRUCTIONSFROMCURRLOAD()
return
if OrderingTime < LastStore Time then
OrderingTime <- LastStore Time
7: function RETIRESTORE()
8:
if Store OrderingTime < LastAccess Time then
9:
Store OrderingTime <- LastAccess Time
10:
11:
12:
function COMMITFENCE()
if OrderingTime < Store OrderingTime then
OrderingTime <- Store OrderingTime
The OrderingTime field in the load queue keeps track of the maximum of (1) the
LastStoreTime observed by previously committed loads and (2) the LastAccessTime
observed by previously retired stores that are separated from the current operation by
a fence. The provides a convenient field to compare the issue time with when making
speculation checks. The OrderingTime is updated and used by the functions presented
in Algorithm 1. The COMMITLOAD function is executed when a load is ready to be
committed. This algorithm checks whether speculation has failed by comparing the
IssueTime of a load against OrderingTime. If the IssueTime is greater, speculation
has succeeded and the OrderingTime is updated using the LastStoreTime observed
by the current load. Else, speculation has failed, and the instructions are replayed
starting from the current load.
88
The RETIRESTORE function is executed when a store is retired from the store
queue (i.e., after it completes execution). Note that the store could have been committed (possibly much earlier) as soon as its address calculation and translation are
done. The StoreOrderingTime field in the store queue keeps track of the maximum of
the LastAccessTime observed by previously retired stores and is updated when each
store is retired.
The COMMITFENCE function is executed when a fence is ready to be committed.
This function updates the OrderingTime using the StoreOrderingTime field and serves
to maintain the Load - Fence ordering.
On the other hand, if the IssueTime is lesser than the OrderingTime, a previous
load operation could have seen a later store, possibly violating consistency. In such
a scenario, the pipeline has to be rolled back and restarted with this offending load
instruction.
The above algorithms suffice for implementing the TSO memory model. If sequential consistency (SC) is to be implemented, all store operations are marked as being
accompanied by an implicit memory fence. Hence, the IssueTime of a speculatively
executed load must be greater than or equal to the maximum of the LastStoreTime
and LastAccessTime observed by previous load and store operations, respectively.
Proof of Correctness
Here, a formal proof is presented on why the above timestamp-based validation preserves TSO memory ordering.
The proof works by constructing a global order of
memory operations that satisfies the Ordering axiom and the Read- Values axiom described earlier in Section 5.3. The global order is constructed by post-processing the
memory trace obtained after executing the program.
The memory trace consists of a 2-dimensional array of memory operations. The
1"
dimension denotes a thread using its ID and the
2 ,d
dimension lists the memory
operations executed by that particular thread. Each memory operation (op) has 3
attributes:
1. Type: Denotes whether the operation is a Load, Store or Fence.
89
2. LastStoreTime: Wall-clock time at which the last modification was done to the
cache line.
3. LastA ccess Time: Wall-clock time at which the last access was made to the cache
line.
Both LastStoreTime and LastAccessTime are used by the timestamp check in
Algorithm 1 presented previously. The post-processing of the memory trace is done
using Algorithm 2.
Algorithm 2 : Global Memory Order (<,) Construction - Basic Timestamp
1: function CONSTRUCTGLOBALMEMORYORDER(NumThreads, MemoryTrace)
2:
<m<- {}
> Initialize the global memory order
3:
for t <- 1 to NumThreads do
> Iterate over all program threads
4:
OrderingTime <- 0
StoreOrderinbgTime- 0
5:
6:
for each op C MemoryTrace[t] do
7:
if op.Type = LOAD then
8:
OrderingTime <- max(OrderirngTime,op.LastStoreTime) + 6
9:
<m [op] <- OrderingTime
10:
else if op.Type = STORE then
11:
StoreOrderingTime+- max(StoreOrderingTime, op.LastAccessTime) + 6
12:
OrderingTime - OrderingTime+ 6
13:
<m [op] +- max(StoreOrderingTime,OrderingTime)
14:
else if op.Type = FENCE then
15:
OrderingTime +- max(OrderingTime,StoreOrderingTime)+ 6
16:
<m [op] *- OrderingTime
return <m
In the outer loop, the algorithm iterates through all the program threads. In the
inner loop, the memory operations (LOAD,STORE & FENCE) of each thread are iterated through in program order. The algorithm assigns a timestamp to each memory
operation and inserts the operation into the global memory order <m in the ascending order of timestamps. The '<m [op] <-' statement ensures that each operation is
inserted in ascending order.
Here, 6 is an infinitesimally small quantity (~
2)
that is added to ensure that a
unique timestamp is assigned to all memory operations from the same thread. For all
programs considered, the result obtained by summing 6 over all memory operations
is less than 1. When inserting in the global order <m, if multiple operations from
different program threads are assigned the same timestamp, then the placement is
90
done in the ascending order of thread IDs (multiple operations from the same thread
cannot have the same timestamp).
Ordering-Time and Store- Ordering-Time are updated based on op.LastStoreTine
and op.LastAccessTime similar to the logic in Algorithm 1 presented earlier.
Ordering Axiom:
The ordering axiom is satisfied by the above assignment of timestamps due to the
following reasons. If the timestamp assigned to a load or fence operation is T, then
the timestamps assigned to all subsequent load, store and fence operations (from the
same program thread) are at least T + 6. This ensures that the Load
Fence
-+
-+
op and
op ordering requirements are met. (Here, op could refer to a Load, Store
or Fence). If the timestamp assigned to a store operation is T, then the timestamps
assigned to all subsequent store and fence operations are at least T,
that the Store
-+
Store and Store
-
+ 6. This ensures
Fence ordering requirements are met.
Read- Values Axiom:
The read-values axioms is satisfied because the above assignment of timestamps
preserves the order of load & store operations to the same cache line.
For exam-
ple, let the time at which a load arrives at the L2 cache be denoted as Load Time.
LastStore Time indicates the time at which the last store to the same cache line
arrived at the L2 cache. By definition, LoadTime > LastStoreTime. Since speculation has succeeded, the timestamp check performed at commit time ensures that
LoadTime > OrderingTime. (Note that the speculation check in Algorithm 1 actually ensures that IssueTime > OrderingTime, but since LoadTime > IssueTime, it
follows that LoadTime > OrderingTime).
Now, the timestamp assigned by Algorithm 2 to the load operation described
earlier is max(OrderingTime, LastStoreTime) + 6. This ensures two invariants: (1)
the load timestamp is strictly after the last observed store operation to the cache line,
and (2) the load timestamp is strictly before the time the load operation arrives at
the cache.
Similarly, the timestamp assigned to a store operation ensures two invariants: (1)
the store timestamp is strictly after the last observed access to the cache line, and
91
(2) the store timestmap is strictly before the time the store operation arrives at the
cache.
The above invariants ensure that the global memory order preserves the actual
order of load & store operations to the same cache line, thereby satisfying the readvalues axiom.
5.4.3
Finite History Queues
One main drawback of the algorithm discussed earlier was that the size of the load/store
history queues could not be bounded (i.e., they could grow arbitrarily large). The
objective of this section is to bound the size of the history queues. Note that the
timestamps could still grow arbitrarily large (we will remove this requirement in Section
.A. A).
The history queues can be bounded by the observation that the processor cores
only care about load/store request timestamps within the scope of their reorder buffer
(ROB). For example, if the oldest micro-op in the reorder buffer has been dispatched
at 1000 cycles, the processor core does not care about load/store history earlier than
1000 cycles. This is because memory requests from other cores carried out before this
time will not cause consistency violations. Hence, history needs to be retained only
for a limited interval of time called the History Retention Period (HRP). After the
retention period expires, the corresponding entry can be removed from the history
queues. How long should the history retention period be? Intuitively, if the retention
period is equal to the maximum lifetime of a load or store operation starting from
dispatch till completion, no usable history will be lost. However, the maximum lifetime of a memory operation cannot be bounded in a modern multicore system due
to non-deterministic queueing delays in the on-chip network and memory controller.
An alternative is to set HRP such that nearly all (~
99%) memory operations
complete within that period. If operations do not complete within HRP, then spurious violations might occur. As long as these violations only affect overall system
performance and energy by a negligible amount, they can be tolerated.
Increas-
ing the value of HRP reduces spurious violations but requires large history queues
92
(L2LHQ/L2SHQ) while decreasing the value of HRP has the reverse effect (this will
be explained below). Finding the optimal value of HRP is critical to ensuring good
performance and energy efficiency.
Speculation Violation Detection:
Speculation violations are detected using the same algorithms in Section 5.4.2. However, since entries can be removed from the load/store history queues, checking the
history queues may not yield the latest load and store timestamps. Hence, an informed assumption regarding previous history has to be made. If no load or store
history is observed for a memory request, it can be safely assumed that the request did
not observe any load/store operations at the L2 cache after 'CompletionTime - HRP'.
Here, CompletionTime is the timestamp when the load/store request completes (i.e.,
the load value is returned or the store is acknowledged).
To prove the above-statement for a load, consider where the LoadTime (i.e., the
-
time when the data is read from the L2 cache) lies in relation to CompletionTime
HRP. If LoadTime < CompletionTime - HRP, then the statement is trivially true.
If LoadTime > CompletionTime - HRP, then the load would only observe the results
of a store operation carried out between CompletionTime - HRP and LoadTime.
However, if this is the case, then the store operation timestamp would be still visible
in the store history (since the retention period has not expired). A similar logic holds
for the load history as well. Hence, if no history is observed, the LastStoreTime and
LastAccessTime required by the algorithms in Section 5.4.2 can be adjusted as shown
by the ADJUSTHIsTORY function in Algorithm 3.
Note that 'NONE' indicates no
load/store history for a particular address.
Algorithm 3 : Finite History Queues
1: function ADJUSTHISTORY()
2:
if LastLoadTime = NONE then
LastLoad Time +- CompletionTime - HRP
3:
4:
5:
if LastStoreTime = NONE then
6:
LastAccess Time
LastStoreTime
-
CompletionTime - HRP
<- MAX( LastLoad Time,
93
LastStore Time)
Finite Queue Management:
Adding entries to the history queue and searching for an address works similar to
the description in Section 5.4.2. However, with finite number of entries, two extra
considerations need to be made.
1. History queue overflow needs to be considered and accommodated.
2. Queue entries need to be pruned after the history retention period (HRP) expires.
The finite history queue is managed using a set-associative structure, that is indexed based on the address (just like a regular set-associative cache).
Queue Insertion: When an entry (<Address, Timestamp>) needs to be added to
the queue, the set corresponding to the address is first read into a temporary register.
A pruning algorithm is applied on this register to remove entries that have expired
(this will be explained later). Then, the <Address,Timestamp> pair is added to the
set as follows.
If the address is already present, then the maximum of the already
present timestamp and the newly added timestamp is computed and written. If the
address is not present, then the algorithm checks whether an empty entry is present.
If yes, then the new timestamp and address are written to the empty entry. Else, the
oldest timestamp in the set is retrieved and evicted. The new <Address,Timestamp>
pair is written in its place. In addition to the set-associative structure, each queue
also contains a Conservative Timestamp (Cons Time) field. The ConsTime field is
used to hold timestamps that have overflowed till they expire. The ConsTime field is
updated with the evicted timestamp.
Pruning Queue: A pruning algorithm removes entries from the queue after their
retention period (HRP) has expired (and/or) resets the ConsTime field. This function
uses a second counter called TC-H (shown in Figure 5-2). This counter lags behind the
timestamp counter (TC) by HRP cycles. If a particular timestamp is less than TC-H,
the timestamp has expired and can be removed from the queue. On every processor
cycle, the TC-H value is also compared to ConsTime.
expired and can be reset to 'NONE'.
94
If equal, the ConsTime has
Searching Queue: When searching the queue for an address, the set corresponding
to the address is first read into a temporary register. If there is an address match,
then the timestamp is returned. Else, the ConsTime field is returned. To maintain
maximum efficiency, the ConsTime field should be 'NONE' (expired) most of the time,
so that spurious load/store times are not returned.
All the above functions require the entries in a set to be searched in parallel to
be performance efficient.
However, since these history queues are really small (<
0.8KB), the above computations can be performed within the cache access time. We
observe experimentally that setting the associativity of the history queues to the
same associativity as the cache keeps the ConsTime field expired for nearly all of the
execution.
5.4.4
In-Flight Transaction Timestamps
One of the main drawbacks of the algorithm presented in Section 5.4.2 is that the
timestamps should be of infinite width, i.e., they are never allowed to rollover during
the operation of the processor. This drawback can be removed by the observation
that only the timestamps of memory operations present in the reorder buffer (ROB)
need to be compared when detecting consistency violations, i.e., only memory transactions that have temporal proximity could create violations. Hence, Load and Store
timestamps need to be distinct and comparable only for recent memory operations.
Finite Timestamp Width (TW): This observation can be exploited to use a finite
timestamp width (TW). When the timestamp counter reaches its maximum value,
it rolls over to '0' in the next cycle. The possible values that can be taken by the
counter can be divided into two quantums, an 'even' and an 'odd' quantum. During
the even quantum, the MSB of the timestamp counter is '0' while during the odd
quantum, the MSB is '1'. For example, if the timestamp width (TW) is 3 bits, values
0-3 belong to the even quantum while 4-7 belong to the odd quantum.
Now, to
check which timestamp is greater than the other, it needs to be known whether the
current quantum (i.e., the MSB of the TC
odd quantum.
[Timestamp Counter]) is an even or an
If the current quantum is even, then any timestamp with an even
95
quantum is automatically greater than a timestamp with an odd quantum and vice
versa.
If two timestamps of the same quantum are compared, a simple arithmetic
check suffices to know which is greater.
Recency of Timestamps: A possible problem with the above comparison is when
the timestamps that are compared are not recent. For example, consider a system with
a timestamp width (TW) of 3 bits. Assume TC is set to 3 to start-with. Timestamp,
TA
is now generated and set to the value of TC, i.e., 3. Then, TC increments, reaches
its maximum value of 7 and rolls-over to 1. Now, another timestamp,
If the check,
TA > TB
discussed above. But,
TB
is set to 1.
is now performed, the result is true according to the algorithm
TA
was generated before TB, so the result should have been
false. The comparison check returned the wrong answer because
TA
was 'too old' to be
useful. Timestamps have to be 'recent' in order to return an accurate answer during
comparisons. Given a particular value of the timestamp counter (TC), timestamps
have to be generated in the current quantum or the previous quantum to be useful
for comparison. In the worst case, a timestamp should have been generated at most
2 TW-i
cycles before the current value of TC to be useful.
Consistency Check: In the algorithms described previously, the only arithmetic
check done using timestamps is at the commit point of load operations. The check
performed is: IssueTime > OrderingTime. If both IssueTime and OrderingTime are
recent, the check always returns a correct answer, else it might return an incorrect
answer. Now, if IssueTime is recent and OrderingTime is old, the 'correct' answer
for the consistency check is true, however, it might return false in certain cases. The
answer being 'false' is OK, since all it triggers is a false positive, i.e., it triggers a
consistency violation while in reality, there is no violation. As long as the number of
false positives is kept low, the system functions efficiently. So, the important thing is
to keep IssueTime recent.
This is accomplished by adding another bit to each reorder buffer (ROB) entry
to track the MSB of its DispatchTime (i.e., the time at which the micro-op was
dispatched to the ROB). So, each ROB entry tracks if it was dispatched during the
even or the odd quantum. If the DispatchTime is kept 'recent', the IssueTime of a
96
load operation will also be recent, since issue is after dispatch.
The DispatchTime
is kept recent by monitoring the entry at the head of the ROB and the timestamp
counter (TC). If the TC rolls over from the odd to the even quantum with the head of
the ROB pointing to an entry dispatched during the even quantum, then that entry's
timestamp is considered 'old'. A speculation violation is triggered and instructions
are replayed starting with the one at the head of the ROB. Likewise, if the TC rolls
over from the even to the odd quantum with the head pointing to an odd quantum
entry, a consistency violation is triggered. Through experimental observations, the
timestamp width (TW) is set to 16 bits. This keeps the storage overhead manageable
while creating almost no false positives. With TW = 16, each entry in the ROB has
2
-TW-1
= 32768 cycles to commit before a consistency violation is triggered due to
rollover.
5.4.5
Mixing Remote Accesses and Private Caching
The previous sections described the implementation of TSO on a pure remote access
scheme. The locality-aware protocol chooses either remote access at the shared L2
cache or private caching at the Li cache based on the spatio-temporal locality of
data. Hence, the timestamp-based consistency validation scheme should be adapted
to such a protocol.
Li Cache History Queues (L1LHQ/L1SHQ):
Such an adaptation requires information about loads and stores made to the private L1-D cache to be maintained for future reference in order to perform consistency validation. This information needs to be captured because private L1-D cache
loads/stores can execute out-of-order and interact with either remote or private cache
accesses such that the TSO memory consistency model is violated. Similar to the history queues at the L2 cache, the Li Load History Queue (L1LHQ) and the Li Store
History Queue (L1SHQ) are added at the LI-D cache (shown in Figure 5-2) and
capture the load and store history respectively. The history retention period (HRP)
97
dictates how long the history is retained for. The management of the L1LHQ/L1SHQ
(i.e., adding/ pruning/ searching) is carried out in the exact same manner as the L2
history queues.
With history queues at multiple levels of the cache hierarchy, it is important to
keep them synchronized. An invalidation, downgrade, or eviction request at the Ll-D
cache causes the last load/store timestamps (if found) to be sent back along with the
acknowledgement so that they can be preserved at the shared L2 cache history queues
until the history retention period (HRP) expires. From the L2LHQ/L2SHQ, these
load/store timestamps could be passed onto cores that remotely access or privately
cache data. A cache line fetch from the shared L2 cache into the Ll-D cache copies
the load/store history for the address into the L1LHQ/L1SHQ as well. This enables
the detection of consistency violations using the same timestamp-based validation
technique described earlier.
Entries are added to and retrieved from the Li history queues using the same
mechanism as described for the L2 history queues. Pruning works in the same way
as well. If the Li history queues overflow, the mechanism described earlier is used to
retrieve a conservative timestamp.
Exclusive Store Prefetch:
Since exclusive store prefetch requests can be used to improve the performance of
stores that are cached at the L1-D cache, they must be leveraged by the localityaware protocol also.
In fact, these prefetch requests can be leveraged by remote
stores to prefetch cache lines from off-chip DRAM into the L2 cache. This can be
accomplished only if both private and remote stores are executed in two phases.
The 1st phase (exclusive store prefetch) is executed in parallel and potentially
out-of program order as soon as the store address is ready. If the cache line is already
present at the L2 cache, then the 1st phase for remote stores is effectively a NOP but
must be executed nevertheless since the information about whether a store is handled
remotely or cached privately is only present at the directory (that is co-located with
the shared L2 cache).
The 2nd phase (actual store) is executed in order, i.e., the
98
store is issued only after all previous stores have completed (to ensure TSO ordering)
and the 1st phase of the current store has been acknowledged.
stores to the private L1-D cache complete quickly (~
The 2nd phase for
1-2 cycles), while remote stores
have to execute a round-trip network traversal to the remote L2 cache before being
completed.
Parallel Remote Stores
5.4.6
This section deals with improving the performance of remote store requests. Serializing remote store requests could impact the processor performance in two ways:
1. Increase the store queue stalls due to capacity limitations.
2. Increase the reorder buffer (ROB) stalls due to store queue drains when fences
are present.
If an application has a significant stream of remote stores, the store queue gets
filled up quickly and stalls the pipeline. The pipeline can now execute only at the
throughput of remote store operations. To match the maximum throughput of a processor pipeline with a fetch width of 2 instructions, store operations have to execute
at a throughput of one every 4.5 cycles (assuming 1 memory reference every 3 instructions and
1 store every 3 memory accesses). This is not possible since a round-trip
network request takes
-
24 cycles (assuming an 8x8 mesh and a 2 cycle per-hop
latency) without factoring in the cache access cost or network contention. Moreover,
if the application contains fences, the fence commit has to stall till the store buffer
has been drained which takes a higher amount of time if remote store operations are
present. This increases the occupancy of the fence in the ROB which increases the
probability of ROB stalls.
In this section, a mechanism is developed that allows remote store operations to be
issued in-order, executed in parallel and completed out-of-order with other memory
operations. The mechanism is based on the idea that a processor that issues store
operations to memory in program order is indistinguishable from one that executes
99
store operations in parallel and are observed by a thread running on a different core
in program order. This is accomplished by leveraging the two-phase mechanism used
when executing remote store operations. The 1st phase, in addition to performing an
exclusive prefetch, informs the L2 cache that a remote store is pending. The remote
cache captures this information in an additional storage queue called the Pending
Store History Queue (PSHQ) (shown in Figure 5-2). The PSHQ captures the time
at which the pending store history arrived at the remote L2 cache. (Note that for a
private L1-D cache store, the PSHQ is not modified). The purpose of the PSHQ is to
inform future readers that a remote store is about to happen in the near future. This
information is used in the speculation validation process as will be described later.
The 2nd phase, when performing the actual store operation (i.e., writing the
updated value to the remote cache), removes the corresponding entry from the PSHQ.
The 2nd phase of a private/remote store can be issued only after all previous microops have committed (to ensure precise interrupts), all previous stores to the private
L1-D cache have completed and the 1st phases of all previous remote stores have been
acknowledged.
Waiting for all previous private L1-D cache stores to be completed
implies that the StoreA -+StoreB ordering is trivially preserved if StoreA is intended
for the private L1-D cache.
When an L2 cache load request is made, the cache controller searches the PSHQ
in addition to the L2LHQ/L2SHQ and returns a LastPendingStoreTime (if found).
In case multiple entries are found, the minimum timestamp (i.e., the oldest entry)
is returned to the requester.
The COMMITLOAD function in Algorithm 4 is used
to validate (at the commit time of a load) that the Load-+Load and Store-+Store
ordering requirements have been met. At the commit time of a load, the minimum of
the LastPendingStoreTime and the IssueTime of the load are compared to OrderingTime. If greater, it has two implications: (1) the load has been issued after all stores
observed by preceding loads have been written to the shared L2 cache, and (2) any
pending store occurs after all the stores observed by the current thread, in program
order. However, if the LastPendingStoreTime is lesser than the OrderingTime, it is
possible that the pending store occurs before a store observed by the current thread
100
in program order and hence, memory consistency could be violated.
The pending
store history serves as an instrument for enabling future stores to be executed before
the 2nd phase of the current store has been completed.
Algorithm 4 : Parallel Remote Stores
1: function COMMITLOAD()
2:
if MIN(IssueTime, LastPendingStoreTime) < OrderingTime then
REPLAYINSTRUCTIONsFROMCURRLOAD()
3:
4:
5:
6:
return
if OrderingTime < LastStoreTime then
OrderingTime = LastStoreTime
Queue Overflow: If the PSHQ is full, then an entry is not added during the 1st
phase. The requesting core is notified of this fact during the acknowledgement.
In
this scenario, the core delays executing the 2nd phase of all future stores until the 2nd
phase of the current remote store has been completed. This trivially ensures that the
current remote store is placed before later stores (both private/remote) in the global
memory order.
Speculation Failure: If speculation fails, then the PSHQ entries corresponding to
the remote stores on the wrong path have to be removed. This is accomplished using
a separate network message that removes the PSHQ entries before instructions are
replayed.
5.4.7
Overheads
The overheads of the timestamp-based technique that uses all the mechanisms described previously are stated below.
Li Cache: The L1SHQ (L1 store history queue) is sized based on the expected
throughput of store requests to the private L1-D cache and the History Retention
Period (HRP). In Section 5.8.3, HRP is fixed to be 512 ns.
A memory access is
expected every 3 instructions, and a store is expected every 3 memory accesses, so
for a single issue processor with a 1 GHz clock, a store is expected every 9
ns. Each
SHQ entry contains the store timestamp and the physical cache line address (42 bits).
The width of each timestamp is 16 bits (as discussed above). Hence, the size of the
101
L1SHQ =
2x+
bits = 0.4KB. The throughput of loads is approximately twice
that of stores, hence the size of the L1LHQ (LI load history queue) is 0.8KB.
Since the L1LHQ and L1SHQ are much smaller than the L1-D cache, they can
be accessed in parallel to the cache tags, and so, do not add any extra latency. The
energy expended when accessing these structures is modeled in our evaluation.
L2 Cache: The L2SHQ is sized based on the expected throughput of remote store
requests to the L2 cache and invalidations /write-backs from the L1-D cache. The
throughput of remote requests is much less than that of private L1-D cache requests,
but can be susceptible to higher contention if many remote requests are destined for
the same L2 cache slice. To be conservative, the expected throughput is set to one
store every 18 processor cycles (this is 4x the average expected throughput from
experiments).
The same calculation (listed above) is repeated to obtain a size of
0.2KB. The L2LHQ has twice the expected throughput as the L2SHQ, so its size is
0.4KB.
The PSHQ (Pending Store History Queue) is sized based on the expected throughput of remote stores as well. Each pending store is maintained till its corresponding
2nd phase store arrives. This latency is conservatively assumed to be 512 ns. Also,
each entry in the PSHQ contains a requester core ID in addition to the address and
5 2
timestamp. Hence, the size of the PSHQ
6 6 42
x(1 18
+ + )bits
= 0.2KB.
Since the L2LHQ, and L2SLHQ are much smaller than the L2 cache, they can
be accessed in parallel to the cache tags, and so do not add any extra latency. The
energy expended when accessing these structures is modeled in our evaluation.
Load/Store Queues & Reorder Buffer: Each load queue and store queue entry
is augmented with 3 and 1 timestamps respectively (as shown in Figures 5-3 & 5-4).
With 64 load queue entries, the overhead is 64 x 3 x 16 bits = 384 bytes. With 48
store queue entries, the overhead is 48 x 16bits = 96 bytes. A single bit added to the
ROB for timestamp overflow detection only has a negligible overhead.
Network Traffic: Whenever an entry is found in the L1LHQ/L1SHQ on an invalidation
/ write-back request or an entry is found in the L2LHQ/L2SHQ during a
remote access or cache line fetch from the L2 cache, the corresponding timestamp is
102
added to the acknowledgement message. Since each timestamp width is 16 bits and
&
the network flit size is 64 bits (see Table 2.1), even if all three of the load, store
pending-store timestamps need to be accommodated, only 1 extra flit needs to be
added to the acknowledgement.
The total storage overhead is ~ 2.5KB counting all the above changes.
5.4.8
Forward Progress & Starvation Freedom Guarantees
Forwardprogress for each core is guaranteed by the timestamp-based consistency validation protocol. To understand why, consider the two reasons why load speculation
could fail: (1) Consistency Check Violation, and (2) Timestamp Rollover.
If speculation fails due to the consistency check violation, then re-executing the
load is guaranteed to allow it to complete. This is because the IssueTime of the load
(when executed for the second time) will always be greater than the time at which
the consistency check was made, i.e., the commit time of the load, which in turn is
greater than OrderingTime. This is because OrderingTime is simply the maximum
of load and store timestamps observed by previous memory operations, and the time
at which the load is committed is trivially greater than this.
If speculation fails due to timestamp rollover, then re-executing the load/stores is
guaranteed to succeed because it cannot conflict with any previous operation. Since
forward progress is guaranteed for all the cores in the system, this technique of ensuring TSO ordering is starvation-free.
5.5
5.5.1
Discussion
Other Memory Models
Section 5.4 discussed how to implement the TSO memory ordering on the localityaware coherence protocol.
The TSO model is the most popular, being employed
by x86 and SPARC processors.
Other memory models of interest are Sequential
Consistency (SC), Partial Store Order (PSO), and the IBM Power/ARM models.
103
We provide an overview of how they can be implemented with the timestamp-based
scheme.
Sequential Consistency (SC) can be implemented by associating an implicit fence
after every store operation. Hence, in the
RETIRESTORE
function in Section 5.4.2,
each store directly updates OrderingTime using its LastAccessTime.
This ensures
that the Store -+Load program order is maintained in the global memory order.
Partial Store Order (PSO) relaxes the Store -+Store ordering and only enforces it
when a fence is present. This enables all stores, both private & remote, to be issued
in parallel and potentially completed out-of-order.
On a fence that enforces Store
-+Store ordering, stores after a fence can be issued only after the stores before it
complete.
IBM Power is a more relaxed model that enforces minimal ordering between memory operations in the absence of fences. Here, we discuss how its two main fences,
lwsync and hwsync are implemented. The lwsync fence enforces TSO ordering and
can be implemented by maintaining a LoadOrderingTime field that keeps track of
the maximum LastStoreTime observed so far. On a fence, the LoadOrderingTime
is copied to the OrderingTime field and the timestamp checks outlined earlier are
run. The hwsync fence enforces SC ordering. This can be implemented by taking
the maximum of the LoadOrderingTime and StoreOrderingTime and updating the
OrderingTime field with this maximum. The ARM memory model is similar to the
IBM Power model and hence can be implemented in a similar way.
5.5.2
Multiple Clock Domains
The assumption in Section 5.4 was that there was only a single clock domain in the
system. However, current multicore processors are gravitating towards multiple clock
domains with independent dynamic frequency scaling (DVFS). All current processors
have a global clock generation circuit (PLL) (to the best of our knowledge). The global
clock is distributed using digital clock divider circuits to enable multiple cores to run
at different frequencies. However, clock boundaries are synchronous and predictable.
This is done because asynchronous boundaries within a chip have reliability concerns
104
and PLLs are power hungry circuits. If clock boundaries are synchronous, per-core
timestamps that are incremented on each core's local clock can be translated into
'global' timestamps using a multiplier that is inversely proportional to each core's
frequency.
The per-core timestamp counters, however, cannot be clock-gated since they must
always be translatable into a global timestamp using a multiplier. If a core changes
its frequency, the timestamp counter must be re-normalized to the new frequency and
the multiplier changed as well. Lastly, in the event that multiple clock generation
circuits are used, it is possible to bound the clock skew between cores and incorporate
this skew into our load speculation validator (basically, each store timestamp would
be incremented with this skew before comparison).
5.6
Parallelizing Non-Conflicting Accesses
An alternate/ complementary method to exploit memory level parallelism (MLP)
while maintaining TSO is to recognize the fact that only conflicting accesses to shared
read-write data can cause memory consistency violations. Concurrent reads to shared
read-only data and accesses to private data cannot lead to violations
[84]. Such mem-
ory accesses can be both issued and completed out-of-order. Only memory accesses
to shared read-write data must be ordered as described in Section 5.4.1.
5.6.1
Classification
In order to accomplish the required classification of data into private, shared readonly and shared read-write, a page-level classifier is built by augmenting existing TLB
and page table structures. Each page table entry is augmented with the attributes
shown in Figure 5-5.
" Private: Denotes if the page is private ('1') or shared ('0').
" Read-Only: Denotes if the page is read-only ('1') or read-write ('0').
105
Figure 5-5: Structure of a page table entry.
* Last-Requester-ID: Tracks the last requester that accessed the particular page.
Aids in classifying the page as private or shared.
Let us describe how memory accesses are handled under this implementation.
Load Requests:
On a load request, the TLB is looked up to obtain attributes of a
page. If the memory request misses the TLB, the TLB miss handler is invoked. The
operation of this handler is described in Section 5.6.2. Once the TLB miss handler
populates the TLB with the page table entry of the requested address, the attributes
are looked up again. If the page is private or read-only, then the load request can be
immediately issued to the cache subsystem. But if the page is shared and read-write,
the load request waits in the load queue till all previous loads complete.
Store Requests:
On a store request, the TLB is looked up to obtain the attributes.
If it is a TLB miss, the miss handler is invoked again. Once the TLB is populated, the
attributes are looked up. If the page is read-only, the Page Protection Fault Handler
is invoked. The operation of this handler is described in Section 5.6.3.
When the
handler returns, the read-only attribute of the page should be marked as '0'. Now, if
the page is private, the store request can be issued as long as previous operations have
all committed (this is to maintain single-thread correctness). On the other hand, if
the page is shared, the store request waits in the store queue till all previous stores
have completed.
5.6.2
TLB Miss Handler
On a TLB miss, the handler checks if the page is being accessed for the first time.
If yes, the Private bit is set to '1' and the last-requester-ID is set to the ID of the
currently accessing core. The read-only bit is also set to '1' since this is the very first
access to the page. On the other hand, if the page was accessed earlier, the status
of the private bit is checked.
If the private bit is set to 1, the last-requester-ID is
106
checked against the ID of the currently accessing core. If it matches, the page table
entry is simply loaded into the TLB.
Else, the entry in the last-requester-ID'sTLB is invalidated. The invalidation request only returns after all outstanding memory operations to the same page made by
the last-requester-ID have completed and committed. This is required to ensure that
the TSO model is not violated because these outstanding requests are now directed
at shared data. In addition, the private bit is set to '0' since this page is now shared
amongst multiple cores. On the other hand, if the private bit was set to '0' to start
with, then the page table entry is simply loaded into the TLB.
5.6.3
Page Protection Fault Handler
A page-protection fault is triggered when a write is made to a page whose read-only
attribute is set to '1'. The fault handler first checks if the page is private. If so, the
read-only is simply set to '0'. If the page is shared, then the TLB entries of all the
cores in the system are invalidated. Finally, the read-Only attribute is set to '0' and
the handler returns.
5.6.4
Discussion
The advantage of the above scheme is that it requires negligible additional hardware capabilities.
However, operating system support is required.
Moreover, the
performance is highly dependent on the number of private and read-only pages in the
application.
5.6.5
Combinining with Timestamp-based Speculation
The TSO ordering of shared read-write data can be implemented in a straightforward
manner following the steps in Section 5.4.1. A higher performing strategy would be
to execute the accesses to shared read-write data out-of program order and employ
the timestamp-based speculative execution scheme discussed in the previous sections
for ensuring TSO ordering.
Note that using the timestamp check only for shared
107
read-write data implies that the history queue modifications and search operations
can be avoided for private & shared read-only data. This reduces the energy overhead
of the history queues. We will evaluate these two approaches in this chapter.
5.7
Evaluation Methodology
We evaluate a 64-core shared memory multicore using out-of-order cores. The default
architectural parameters used for evaluation are shown in Table 2.1. The parameters
specific to the timestamp-based speculation violation detection and the locality-aware
private cache replication scheme are shown in Table 5.1.
Architectural Parameter
Value
Out-of-Order Core
Speculation Violation Detection
Timestamp-based
Timestamp Width (TW)
16 bits
History Retention Perion (HRP)
512 ns
Li Load/Store History Queue (L1LHQ//L1SHQ) Size 0.8 KB, 0.4 KB
L2 Load/Store History Queue (L2LHQ/L2SHQ) Size 0.4 KB, 0.2 KB
Pending Store History Queue (PSHQ) Size
0.2 KB
Locality-Aware Private Cache Replication
Private Caching Threshold
PCT= 4
Max Remote Access Threshold
RA TmaX = 16
Number of RAT Levels
nRATievels = 2
Classifier
Limited 3
Table 5.1: Timestamp-based speculation violation detection & locality-aware cache
private cache replication scheme parameters.
5.7.1
Performance Models
Word & Cache Line Access: The Locality-aware Private Cache Replication protocol requires two separate access widths for reading or writing the shared L2 caches
i.e., a word for remote sharers and a cache line for private sharers. For simplicity, we
assume the same L2 cache access latency for both word and cache line accesses.
Load Speculation: The overhead of speculation violation is modeled by fast-forwarding
the fetch time of the offending instruction to its commit time (on the previous at108
tempt) when a violation is detected. This models the first-order performance drawback of speculation. However, the performance drop due to the network traffic and
cache accesses incurred by in-flight instructions is not modeled.
Energy Models
5.7.2
Word & Cache Line Access: When calculating the energy consumption of the L2
cache, we assume a word addressable cache architecture.
This allows our protocol
to have a more efficient word access compared to a cache line access. We model the
dynamic energy consumption of both the word access and the cache line access in the
L2 cache.
Load Speculation: The energy overhead due to speculation violations is obtained
using the following analytical model.
Energy-Speculation= Speculation-Stall-Cycles x IPC x Energy-per-Instruction
The first 2 terms together generate the total number of inflight instructions when
the speculation violation was detected. If this is multiplied by the average energy per
instruction, the energy overhead due to speculation can be obtained.
History Queues: The energy consumption for the
INSERT
and SEARCH operations
of each history queue is conservatively assumed to be the amount of energy it takes
for an L1-D cache tag read. Note that the size of the L1-D cache tag array is 2.6 KB.
The tag array contains 512 tags, each 36 bits wide (subtracting out the index and
offset bits from the physical address).
On the other hand, the size of each history
queue is < 0.8 KB.
5.8
5.8.1
Results
Comparison of Schemes
In this section, we perform an exhaustive comparison between the various schemes
introduced in this chapter to implement locality-aware coherence on an out-of-order
109
processor while maintaining the TSO memory model. The comparison is performed
against the Reactive-NUCA protocol. All implementations of the locality-aware protocol use a PCT value of 4. Section 5.8.2 describes the rationale behind this choice.
" Instructions
0Li-1 Fetch Stalls
U Compute Stalls
U Memory Stalls
" Load Speculation
Y Branch Speculation
* Synchronization
N Idle
-1.6
1
I.2
*02
VOIREND
BARNES
OCEAN-NC
1.6
RADIX
LU-C
WATER-NSQ
CANNEAL
FACESIM
SWAPTIONS
BIACKSCH
IPI111II
P 0.6
1.4
0.
E 1.2
0~
0IIIgII
iii~:
B
Figre
YTAK
EUP
PATRICIA
CONCOMP
COMMUNITY
v
11
1111
DIJKSTRA-AP
TSP
DFS
-6:Completion Time breakdown for the schemes evaluated.
MATMUL
Results are
n LzENd
t
ARNES t Of R-NUCA. Note that Average and not Geometric-Mean is plotted
here.
1. Reactive-NUCA (RNUCA): This is the baseline scheme that implements the
data placement and migration techniques of R-NUCA (basically, the localityaware protocol with a PCT of 1).
2. Simple TSO Implementation (SER): The simplest implementation of TSO on
the locality-aware protocol that serializes memory accesses naively according to
TSO ordering (c.f. Section 5.4.1).
3. Parallel Non-Conflicting Accesses (NC): This scheme classifies data as shared/private
110
"
Li-1 Cache
" Network Router
* L1-D Cache
0 L2 Cache
* Directory
U Network Link
0 DRAM
* History Queues
-1.2
S.
00.6
~0.4
S0.2
BA
VOLREND
RNES
RADIX
OCEAN-NC
LU-C
PACESIM
CANNEAL
WATER-NSQ
BLACKSCII.
SWAPTIONS
1.2
S0.8
00.6
C
BB
I- I I
It 0. 2
0j
ow: UZ
DEDUP
PATRICIA
01 U
liffli
iiiih
tt -
=,wZ 4e =
ZW
W
Z
Z
BODYTRACK
-Iluz
CONCOMP
COMMUNITY
Cw, UZ
?<
DIJKSTRA-AP
0
uj
2
Z
TSP
6W V
wZ
n
Z
6=ur-8
=WZ
ccZ
DFS
MATMUL
I'oil
U-Z
5
AVERAGE
Figure 5-7: Energy breakdown for the schemes evaluated. Results are normalized to
that of R-NUCA. Note that Average and not Geometric-Mean is plotted here.
and read-only/read-write at page-granularity and only applies serialization to
shared read-write data (c.f. Section 5.6).
4. Timestamp-based Consistency Validation (TS): Executes loads speculatively
using timestamp-based validation (c.f. Section 5.4) for shared read-write data.
Shared read-only and private data are handled as in the NC scheme.
5. Timestamp-based + Stall Fence (TS-STF): Same as TS but the micro-op dispatch is stalled till a fence completes to ensure Fence -+Load ordering. The
L1LHQ & L2LHQ are not required for detecting violations here. It has lower
hardware overhead than TS but potentially lower performance due to stalling
on fence operations.
6. No Speculation Violations (IDEAL): Same as TS but speculation violations are
ignored. It provides the upper limit on performance and energy consumption.
The LI & L2 history queues are not required since no speculation failure checks
111
are made.
The completion time and energy consumption of the above schemes are plotted
in Figures 5-6 & 5-7 respectively.
Completion Time: The parallel completion time is broken down into the following
8 categories:
1. Instructions: Number of instructions in the application.
2. Li-I Fetch Stalls: Stall time due to instruction cache misses.
3. Compute Stalls: Stall time due to waiting for functional unit (ALU, FPU,
Multiplier, etc.) results.
4. Memory Stalls: Stall time due to load/store queue capacity limits, fences and
waiting for loads.
5. Load Speculation: Stall time due to memory consistency violations caused by
speculative loads.
6. Branch Speculation: Stall time due to mis-predicted branch instructions.
7. Synchronization: Stall time due to waiting on locks, barriers and condition
variables.
8. Idle: Initial time spent waiting for a thread to be spawned.
Benchmarks with a high private cache miss rate such
COMP,
and
DEDUP
BARNES, OCEAN-NC, CON-
do not perform well with the SER scheme. This is because the
cache misses cause the load and store queues to fill up, thereby stalling the pipeline.
This can be understood by observing the fraction of memory stalls in the completion
time of these benchmarks (with the SER scheme).
In addition, these benchmarks
contain a significant degree of synchronization as well. This causes the memory stalls
in one thread to increase the synchronization penalty of threads waiting on it, thereby
creating a massive slowdown.
112
The performance problems observed by the SER scheme are shared by the NC
scheme as well particularly in the BARNES, OCEAN-NC and CANNEAL benchmarks.
The NC scheme can only efficiently handle accesses to private and shared read-only
data.
Since these benchmarks contain a majority of accesses to shared read-write
data, the NC scheme performs poorly. The synchronization penalties lead to a massive
slowdown for the NC scheme as well.
Benchmarks such as DEDUP, PATRICIA, CONCOMP and MATMUL contain a significant number of non-conflicting memory accesses (i.e., accesses to private and shared
read-only data), and hence, the NC scheme is able to perform better than the SER
scheme. The improved performance is a result of the lower percentage of memory
stalls (and related synchronization waits).
The TS scheme performs well on all benchmarks and matches the performance
of the IDEAL scheme. The TS scheme performs better than Reactive-NUCA on all
benchmarks except LU-C, where it performs worse due to store queue stalls obtained
from serializing remote stores. Overall, the TS scheme only spends a small amount
of time stalling due to load speculation violations. The stall time is only visible in
one benchmark, OCEAN-NC and even here, stalling due to speculation violation just
replaces the already occurring stalls due to the limited size of the reorder buffer.
The TS-STF scheme stalls the dispatch stage on a fence till all previous stores
have committed. Hence, it performs poorly in benchmarks with significant number
of fences, e.g., CANNEAL and TSP, while it performs well when the number of fences
is negligible, e.g., FACESIM and DEDUP. Note that all fences seen in the evaluated
benchmarks are implicit fences introduced by atomic operations in x86, e.g., test-andset, compare-and-swap, etc.
There were almost no explicit MFENCE instructions
observed.
Overall, the TS scheme improves performance by 16% compared to RNUCA, and
the TS-STF scheme improves performance by 10%. The SER and NC schemes reduce
performance by 14% and 11% respectively.
Energy: All the locality-aware coherence protocol implementations (i.e., all except
RNUCA) are found to significantly reduce L2 cache and network energy due to the
113
following 3 factors:
1. Fetching an entire line on a cache miss is replaced by multiple cheaper word
accesses to the shared L2 cache.
2. Reducing the number of private sharers decreases the number of invalidations
(and acknowledgments) required to keep all cached copies of a line coherent.
Synchronous write-back requests that are needed to fetch the most recent copy
of a line are reduced as well. (Note: increasing the value of PCT reduces the
number of private sharers and increases the number of remote sharers).
3. Since the caching of low-locality data is eliminated, the Li cache space is more
effectively used for high locality data, thereby decreasing the amount of asynchronous evictions (that lead to capacity misses) for such data.
Among the locality-aware coherence protocol implementations, SER, NC, and
IDEAL exhibit the best dynamic energy consumption.
Dynamic energy consump-
tion increases when TS-STF is used and increases even further when TS is used.
This is because both these implementations modify and access the Li & L2 cache
history queues. The TS-STF scheme only requires the store history queues since it
stalls on a fence while the TS scheme requires both load & store history queues to
perform consistency checks, thereby creating a larger energy overhead.
Note that
page-classification is used in both the TS & TS-STF schemes to ensure that history
queue modification & access is only done for shared read-write data since accesses to
private & shared read-only data cannot cause consistency violations. Overall, the TS,
TS-STF, SER & NC schemes reduce energy by 21%, 23%, 25% and 25% respectively
over the RNUCA baseline.
5.8.2
Sensitivity to PCT
In this section, we study the impact of the Private Caching Threshold (PCT) parameter on the overall system performance. PCT controls the percentage of remote
and private cache accesses in the system.
114
A higher PCT increases the percentage
-+-Completion Time
-WEnergy
---
1.1
1
-.
0.9
0.8
0.7
~
-
1
2
~
~
T
3
4
YT
6
8
10
12
14
16
Private Caching Threshold (PCT)
Figure 5-8: Completion Time and Energy consumption as PCT varies from 1 to 16.
Results are normalized to a PCT of 1 (i.e., Reactive-NUCA protocol).
of remote accesses while a lower PCT increases the percentage of cache line fetches
into the private cache. Finding the optimal value of PCT is paramount to system
performance. We plot the geometric means of the Completion Time and Energy for
our benchmarks as a function of PCT in Figure 5-8. We observe a gradual decrease
in completion time till a PCT of 4, constant completion time till a PCT of 8 and then
a gradual increase afterward. Energy consumption reduces steadily till a PCT of 4,
reaches a global minimum at 4 and then increases steadily afterward.
Varying PCT impacts energy consumption by changing both network traffic and
cache accesses. Fetching an entire line on a cache miss requires moving 10 flits over
the network. On the other hand, fetching a word only costs 3 flits while writing a
word to a remote cache costs 5 flits (due to two-phase stores). As PCT increases,
cache line fetches are traded-off with increasing numbers of remote-word accesses.
This causes the network traffic to first reduce and then increase as PCT increases,
thus explaining the trends in energy consumption.
The completion time shows an initial gradual reduction due to lower network
contention.
The gradual increase afterward is due to increased network traffic at
higher values of PCT. Overall, the locality-aware protocol obtains a 16% completion
time reduction and a 21% energy reduction when compared to the Reactive-NUCA
baseline.
115
Sensitivity to History Retention Period (HRP)
5.8.3
,
E -
1.2
1
U HRP-64
-
0"HRP-128
0.8
UHRP-256
R0.6
0.4
o
U
HRP-512
- HRP-1024
0
R
41
*\ HRP-2048
Cd0
~
P-0
a*HRP-4096
Figure 5-9: Completion Time sensitivity to History Retention Period (HRP) as HRP
varies from 64 to 4096.
In this section, the impact of the History Retention Period (HRP) on system performance is studied. Figure 5-9 plots the completion time as a function of HRP. A
small value of HRP reduces the size requirement of the load/store history queues
at the Li and L2 caches (L1LHQ, L1SHQ, L2LHQ & L2SHQ) as described in Section 6.2.3. A small HRP also reduces network traffic since timestamps are less likely
to be found in the history queues, and thus less likely to be communicated in a
network message.
However, a small HRP also discards history information faster,
requiring the mechanism to make a conservative assumption regarding the time the
last loads/stores were made (cf.
Section 5.4.3). This increases the chances of the
speculation check failing, thereby increasing completion time.
From Figure 5-9, we observe that an HRP of 64 performs considerably worse when
compared to the other data points. The high completion time is due to the instruction
replays incurred due to speculation violation. HRP values of 128 and 256 reduce the
speculation violations, and thereby improve performance considerably. However, we
opted to go for an HRP of 512 since its performance is within ~ 1% of an HRP of
4096.
116
5.9
Summary
This chapter studied the efficiency and programmability tradeoffs with a state-of-theart data access mechanism called remote access. Complex cores and strong memory
models impose memory ordering restrictions that have been efficiently managed in traditional coherence protocols. Remote access introduces serialization penalties, which
hampers the memory-level parallelism (MLP) in an application.
A timestamp-based speculation scheme is proposed that enables remote accesses to
be issued and completed in parallel while continuously detecting whether any ordering
violations have occurred and rolling back the pipeline state (if needed). The scheme is
implemented for a state-of-the-art locality-aware cache coherence protocol that uses
remote access as an auxiliary mechanism for efficient data access.
The evaluation
using a 64-core multicore with out-of-order speculative cores shows that our proposed
technique improves completion time by 16% and energy by 21% over a state-of-the-art
cache management scheme while requiring only 2.5 KB storage overhead per-core.
117
118
Chapter 6
Locality-aware LLC Replication
Scheme
This thesis proposes a data replication mechanism for the LLC that retains the onchip cache utilization of the shared LLC while intelligently replicating cache lines
close to the requesting cores so as to maximize data locality. To achieve this goal,
a low-overhead yet highly accurate in-hardware locality classifier is proposed that
operates at the cache line granularity and only allows the replication of cache lines
with high reuse. This classifier captures the LLC pressure and adapts its replication
decision accordingly.
This chapter is organized as follows. Section
6.1 motivates data replication in the
LLC. Section 6.2 describes a detailed implementation of locality-aware replication.
Section 6.3 discusses the rationale behind key design decisions. Section 6.4 designs a
mechanism that performs replication at the cluster-level and discusses the efficacy of
this approach. Section 6.5 describes the evaluation methodology. Section 6.6 presents
the results. And finally, Section 6.7 provides a brief summary of this chapter.
119
Private [1-2]
# Instruction [1-2]
Shared Read-Only [1-2]
Shared Read-Write [1-2]
100%
U Private [3-9]
N Private [>10]
U Instruction [3-9]
U Instruction [ 10]
E Shared Read-Only [3-9] U Shared Read-Only [>10]
N Shared Read-Write [3-9] U Shared Read-Write [>10]
80%
60%
40%
20%
-i
0%
Figure 6-1: Distribution of instructions, private data, shared read-only data, and shared
read-write data accesses to the LLC as a function of run-length. The classification is done
at the cache line granularity.
6.1
6.1.1
Motivation
Cache Line Reuse
The utility of data replication at the LLC can be understood by measuring cache line
reuse. Figure 6-1 plots the distribution of the number of accesses to cache lines in the
LLC as a function of run-length. Run-length is defined as the number of accesses to a
cache line (at the LLC) from a particular core before a conflicting access by another
core or before it is evicted. Cache line accesses from multiple cores are conflicting if at
least one of them is a write. The L2 cache accesses are broken down into the following
four categories: (1) Instruction, (3) Private Data, (3) Shared Read-Only (RO) Data,
and (4) Shared Read- Write (RW) Data For example, in
BARNES,
over 90% of the
accesses to the LLC occur to shared (read-write) data that has a run-length of 10 or
more. Greater the number of accesses with higher run-length, greater is the benefit of
replicating the cache line in the requester's LLC slice. Hence,
from replicating shared (read-write) data.
replicating instructions and
PATRICIA
Similarly,
BARNES
FACESIM
would benefit
would benefit from
would benefit from replicating shared (read-
only) data. On the other hand, FLUIDANIMATE and OCEAN-C would not benefit since
120
most cache lines experience just 1 or 2 accesses to them before a conflicting access
or an eviction. For such cases, replication would increase the LLC pollution without
improving data locality. For such benchmarks, replication in the LLC would increase
memory access latency and invalidation penalty without improving data locality.
Hence, the replication decision should not depend on the type of data, but rather
on its locality. Instructions and shared-data (both read-only and read-write) can be
replicated if they demonstrate good reuse. It is also important to adapt the replication
decision at runtime in case the reuse of data changes during an application's execution.
6.1.2
Cluster-level Replication
Another design option to implement replication at the LLC (/L2 cache) is to perform
replication at the cluster-level instead of creating a replica per-core, i.e., group a set
of LLC slices into a cluster and create at-most one replica per cluster. In order to
gain more insight into this approach, the accesses to the L2 cache are studied as a
function of the sharing degree. The sharing degree of a cache line is defined as the
number of sharers when it is accessed. Figures 6-2 and 6-3 plot this information for
four applications.
The L2 cache accesses are broken down into the following four
categories: (1) Instruction, (2) Shared Read-Only (RO) Data, (3) Shared Read-Write
(RW) Data Read and (4) Shared Read- Write (RW) Data Write. Private Data Read
and Private Data Write are special cases of Shared Read-Write Data read and write
with number of sharers equal to 1.
The majority of L2 cache accesses in BARNES and BODYTRACK are read accesses
to widely shared data. In BARNES, all the accesses are to shared read-write data (this
type of data is frequently read and sparsely written). However, in BODYTRACK, the
accesses are equally divided up between instructions, shared read-only and shared
read-write data. In RAYTRACE and VOLREND, most L2 cache accesses are reads to
shared read-only data. However, the sharing degree of the cache lines that are read is
variable. The sharing degree of cache lines is an important parameter when deciding
the cluster size for replication. While cache lines with a high sharing degree can be
shared by all neighboring cores, cache lines with a limited sharing degree can only be
121
Instruction
Shared R3 Data
Shared RN Data Read
Shared RN Data Mrite
7e+96
60+06
5e+86
M
-6
c'J
-J
36M
2e m
ie+06
a
1,
4,
7,
10,
13, 15, 19, 22, 25, 28, 31, 34, 37, 48, 43, 46, 49, 52, 55, 58,
Nunber of Sharers
(a)
81,
64,
BARNES
3e+87
-
Instruction-
Shared RD Data
Shared RM Data Read
Shared RN Data Mrite
2.5e+87
20+87
-ii
1.5e+87
|
_j
le+67
50+06
L
a
1, 4,
7, 19, 13, 16, 19, 22, 25, 28, 31, 34, 37, 46, 43, 46, 49, 52, 55, 58, 61. 64,
Nunber of Sharers
(b)
BODYTRACK
Figure 6-2: Distribution of the accesses to the shared L2 cache as a function of the sharing
degree. The accesses are broken down into (1) Instruction, (2) Shared Read-Only (RO) Data,
(3) Shared Read- Write (RW) Data Read and (4) Shared Read- Write (RW) Data Write.
122
6908
Instruction
Shared RO Data
Shared RM Data Read
Shared R Data Nrite
1
50e99 I
dJ
I
U
seee
2009M
m
1, 4.
I
7, to, 13, 16, 19, 22, 25, 25, 31, 34, 37, 48, 43, 46, 49, 52, 55, 58, 61, 64,
Nunber of Sharers
(a)
RAYTRACE
608008
Instruction
Shared RD Data
Shared RN Data Read
Shared RN Data riteo
00000
C1
.1
309999
some
2999.
199889
9
1, 4, 7, 19, 13, 16, 19, 22, 25, 26, 31, 34, 37, 49, 43, 46, 49, 52, 55, 58, 61, 54,
Nunber of Sharers
(b)
VOLREND
Figure 6-3: Distribution of the accesses to the shared L2 cache as a function of the sharing
degree. The accesses are broken down into (1) Instruction, (2) Shared Read-Only (RO) Data,
(3) Shared Read- Write (RW) Data Read and (4) Shared Read- Write (RW) Data Write.
123
shared by a restricted number of cores. Increasing the number of replicas improves the
data locality (i.e., L2 hit latency) but also increases the off-chip miss rate. Decreasing
the number of replicas has the opposite effect. This thesis explores a mechanism to
implement cluster-level replication in Section 6.4.
6.1.3
Proposed Idea
We propose a low-overhead yet highly accurate hardware-only predictive mechanism
to track and classify the reuse of each cache line in the LLC. Our runtime classifier
only allows replicating those cache lines that demonstrate reuse at the LLC while
bypassing replication for others. When a cache line replica is evicted or invalidated,
our classifier adapts by adjusting its future replication decision accordingly.
This
reuse tracking mechanism is decoupled from the sharer tracking structures that cause
scalability concerns in traditional cache coherence protocols.
The locality-aware protocol is advantageous because it:
1. Enables lower memory access latency and energy by selectively replicating cache
lines that show high reuse in the LLC slice of the requesting core.
2. Better exploits the LLC by balancing the off-chip miss rate and on-chip locality
using a classifier that adapts to the runtime reuse at the granularity of cache
lines.
3. Allows coherence complexity almost identical to that of a traditional nonhierarchical (flat) coherence protocol since replicas are only allowed to be placed
at the LLC slice of the requesting core. The additional coherence complexity
only arises within a core when the LLC slice is searched on an Li cache miss,
or when a cache line in the core's local cache hierarchy is evicted/invalidated.
6.2
Locality-Aware LLC Data Replication
We describe how the locality-aware protocol works by implementing it on top of the
baseline described in Chapter 2.
124
E p1
2
E3
--AEl
A,
Co m pute
:)_
Private
Li Caches
i
L2 Cache
(LLC Slice) 5
Pipeline
licepli
M
E:]:]-1
=:
lj
T
i
F-1
Router
LII I
Home
are mockup requests showing the locality-aware LLC replication proFigure 6-4: ( tocol. The black data block has high reuse and a local LLC replica is allowed that services
requests from T and g. The low-reuse red data block is not allowed to be replicated at the
LLC, and the request from @ that misses in the L1, must access the LLC slice at its home
core. The home core for each data block can also service local private cache misses (e.g.,
6.2.1
Protocol Operation
The four essential components of data replication are: (1) choosing which cache lines
to replicate, (2) determining where to place a replica, (3) how to lookup a replica, and
(4) how to maintain coherence for replicas. We first define a few terms to facilitate
describing our protocol.
1. Home Location: The core where all requests for a cache line are serialized for
maintaining coherence.
2. Replica Sharer: A core that is granted a replica of a cache line in its LLC
slice.
3. Non-Replica Sharer: A core that is NOT granted a replica of a cache line in
its LLC slice.
4. Replica Reuse: The number of times an LLC replica is accessed before it is
invalidated or evicted.
125
Home Reuse >= RT
Initial
No
No
Replica
Replica
Home Reuse < RT
XReuse >= RT
XReuse < RT
Figure 6-5: Each directory entry is extended with replication mode bits to classify the
usefulness of LLC replication. Each cache line is initialized to non-replica mode with respect
to all cores. Based on the reuse counters (at the home as well as the replica location) and
the parameter RT, the cores are transitioned between replica and non-replica modes. Here
XReuse is (Replica + Home) Reuse on an invalidation and Replica Reuse on an eviction.
5. Home Reuse: The number of times a cache line is accessed at the LLC slice
in its home location before a conflicting write or eviction.
6. Replication Threshold (RT): The reuse above or equal to which a replica is
created.
Note that for a cache line, one core can be a replica sharer while another can be
a non-replica sharer.
Our protocol starts out as a conventional directory protocol
and initializes all cores as non-replica sharers of all cache lines (as shown by Initial
in Figure 6-5).
Let us understand the handling of read requests, write requests,
evictions, invalidations and downgrades as well as cache replacement policies under
this protocol.
Read Requests
On an Li cache read miss, the core first looks up its local LLC slice for a replica.
If a replica is found, the cache line is inserted at the private Li cache. In addition,
a Replica Reuse counter (as shown in Figure 6-6) at the LLC directory entry is
incremented. The replica reuse counter is a saturating counter used to capture reuse
information. It is initialized to '1' on replica creation and incremented on every replica
hit.
On the other hand, if a replica is not found, the request is forwarded to the LLC
home location. If the cache line is not found there, it is either brought in from the
126
Replica
Reuse
ACKWise
Pointers (1 ... p)
Model
Mode
Home Reuse, ...Home Reusen
Complete Locality List (1 .. n)
Figure 6-6: ACKwisep-Complete locality classifier LLC tag entry. It contains the tag, LRU
bits and directory entry. The directory entry contains the state, ACKwisep pointers, a
Replica reuse counter as well as Replication mode bits and Home reuse counters for every
core in the system.
off-chip memory or the underlying coherence protocol takes the necessary actions to
obtain the most recent copy of the cache line. The directory entry is augmented with
additional bits as shown in Figure 6-6. These bits include (a) Replication Mode bit
and (b) Home Reuse saturating counter for each core in the system. Note that adding
several bits for tracking the locality of each core in the system does not scale with the
number of cores, therefore, we will present a cost-efficient classifier implementation
in Section 6.2.2.
The replication mode bit is used to identify whether a replica is
allowed to be created for the particular core. The home reuse counter is used to track
the number of times the cache line is accessed at the home location by the particular
core. This counter is initialized to '0' and incremented on every hit at the LLC home
location.
If the replication mode bit is set to true, the cache line is inserted in the requester's
LLC slice and the private Li cache. Otherwise, the home reuse counter is incremented.
If this counter has reached the Replication Threshold (RT), the requesting core is
"promoted" (the replication mode bit is set to true) and the cache line is inserted in
its LLC slice and private Li cache. If the home reuse counter is still less than RT,
a replica is not created. The cache line is only inserted in the requester's private Li
cache.
If the LLC home location is at the requesting core, the read request is handled
directly at the LLC home. Even if the classifier directs to create a replica, the cache
line is just inserted at the private Li cache.
127
Write Requests
On an Li cache write miss for an exclusive copy of a cache line, the protocol checks
the local LLC slice for a replica. If a replica exists in the Modified(M) or Exclusive(E)
state, the cache line is inserted at the private Li cache. In addition, the Replica Reuse
counter is incremented.
If a replica is not found or exists in the Shared(S) state, the request is forwarded
to the LLC home location.
The directory invalidates all the LLC replicas and Li
cache copies of the cache line, thereby maintaining the single-writer multiple-reader
invariant
[86].
The acknowledgements received are processed as described in Sec-
tion 6.2.1. After all such acknowledgements are processed, the Home Reuse counters
of all non-replica sharers other than the writer are reset to '0'. This has to be done
since these sharers have not shown enough reuse to be "promoted".
If the writer is a non-replica sharer, its home reuse counter is modified as follows.
If the writer is the only sharer (replica or non-replica), its home reuse counter is
incremented, else it is reset to '1'. This enables the replication of migratory shared
data at the writer, while avoiding it if the replica is likely to be downgraded due to
conflicting requests by other cores.
Evictions and Invalidations
On an invalidation request, both the LLC slice and Li cache on a core are probed and
invalidated. If a valid cache line is found in either caches, an acknowledgement is sent
to the LLC home location. In addition, if a valid LLC replica exists, the replica reuse
counter is communicated back with the acknowledgement. The locality classifier uses
this information along with the home reuse counter to determine whether the core
stays as a replica sharer. If the (replica + home) reuse is > RT, the core maintains
replica status, else it is demoted to non-replica status (as shown in Figure 6-5). The
two reuse counters have to be added since this is the total reuse that the core exhibited
for the cache line between successive writes.
When an Li cache line is evicted, the LLC replica location is probed for the same
128
address. If a replica is found, the dirty data in the Li cache line is merged with
it, else an acknowledgement is sent to the LLC home location.
However, when an
LLC replica is evicted, the Li cache is probed for the same address and invalidated.
An acknowledgement message containing the replica reuse counter is sent back to
the LLC home location. The replica reuse counter is used by the locality classifier
as follows. If the replica reuse is > RT, the core maintains replica status, else it is
demoted to non-replica status. Only the replica reuse counter has to be used for this
decision since it captures the reuse of the cache line at the LLC replica location.
After the acknowledgement corresponding to an eviction or invalidation of the
LLC replica is received at the home, the locality classifier sets the home reuse counter
of the corresponding core to '0' for the next round of classification.
The eviction of an LLC replica back-invalidates the Li cache (as described earlier).
A possibly more optimal strategy is to maintain the validity of the Li cache line. This
requires two additional message types, one to communicate back the reuse counter
on the LLC replica eviction and another to communicate the acknowledgement when
the Li cache line is finally invalidated or evicted. We opted for the back-invalidation
for the following two reasons:
1. To maintain the simplicity of the coherence protocol
2. Since the energy and performance improvements of the more optimal strategy are negligible. This is because:
(a) the LLC size is > 4x the Li cache
size, thereby keeping the probability of evicted LLC lines having an Li copy
extremely low, and (b) the LLC replacement policy implemented prioritizes
retaining cache lines that have Li cache copies.
LLC Replacement Policy
Traditional LLC replacement policies use the least recently used (LRU) policy. One
reason why this is sub-optimal is that the LRU information cannot be fully captured
at the LLC because the Li cache filters out a large fraction of accesses that hit within
it. In order to be cognizant of this, the replacement policy should prioritize retaining
129
Core
Tag
LRU
State
Replica
Reuse
ACKWise
Pointers (1... p)
ID1
MoeI
___.l_.
Home Reuse-
-
Core lDk
Coek
Home Reusek
Limited Locality List (1 .. k)
Figure 6-7: ACKwisep-Limitedk locality classifier LLC tag entry. It contains the tag, LRU
bits and directory entry. The directory entry contains the state, ACKwisep pointers, a
Replica reuse counter as well as the Limitedk classifier. The Limitedk classifier contains a
Replication mode bit and Home reuse counter for a limited number of cores. A majority
vote of the modes of tracked cores is used to classify new cores as replicas or non-replicas.
cache lines that have Li cache sharers. Some proposals in literature accomplish this by
sending periodic Temporal Locality Hint messages from the LI cache to the LLC [441.
However, this incurs additional network traffic.
Our replacement policy accomplishes the same using a much simpler scheme. It
first selects cache lines with the least number of L1 cache copies and then chooses the
least recently used among them. The number of Li cache copies is readily available
since the directory is integrated within the LLC tags ("in-cache" directory).
This
reduces back invalidations to a negligible amount and outperforms the LRU policy
(cf. Section 6.6.2).
6.2.2
Limited Locality Classifier Optimization
The classifier described earlier which keeps track of locality information for all the
cores in the directory entry is termed the Complete locality classifier. It has a storage
overhead of 30% (calculated in Section 6.2.3) at 64 cores and over 5x at 1024 cores.
In order to mitigate this overhead, we develop a classifier that maintains locality
information for a limited number of cores and classifies the other cores as replica or
non-replica sharers based on this information.
The locality information for each core consists of (1) the core ID, (2) the replication
mode bit and (3) the home reuse counter.
The classifier that maintains a list of
this information for a limited number of cores (k) is termed the Limitedk classifier.
Figure 6-7 shows the information that is tracked by this classifier. The sharer list of
130
the ACKwise limited directory entry cannot be reused for tracking locality information
because of its different functionality. While the hardware pointers of ACKwise are
used to maintain coherence, the limited locality list serves to classify cores as replica
or non-replica sharers.
Decoupling in this manner also enables the locality-aware
protocol to be implemented efficiently on top of other scalable directory organizations.
We now describe the working of the limited locality classifier.
At startup, all entries in the limited locality list are free and this is denoted by
marking all core IDs' as Invalid. When a core makes a request to the home location,
the directory first checks if the core is already being tracked by the limited locality
list. If so, the actions described previously are carried out. Else, the directory checks
if a free entry exists. If it does exist, it allocates the entry to the core and the same
actions are carried out.
Otherwise, the directory checks if a currently tracked core can be replaced. An
ideal candidate for replacement is a core that is currently not using the cache line.
Such a core is termed an inactive sharer and should ideally relinquish its entry to
a core in need of it. A replica core becomes inactive on an LLC invalidation or an
eviction. A non-replica core becomes inactive on a write by another core. If such a
replacement candidate exists, its entry is allocated to the requesting core. The initial
replication mode of the core is obtained by taking a majority vote of the modes of
the tracked cores. This is done so as to start off the requester in its most probable
mode.
Finally, if no replacement candidate exists, the mode for the requesting core is
obtained by taking a majority vote of the modes of all the tracked cores. The limited
locality list is left unchanged.
The storage overhead for the Limitedk classifier is directly proportional to the
number of cores (k) for which locality information is tracked.
In Section 6.6.3, we
evaluate the storage and accuracy tradeoffs for the Limitedk classifier. Based on our
observations, we pick the Limited3 classifier.
131
6.2.3
Overheads
Storage
The locality-aware protocol requires extra bits at the LLC tag arrays to track locality
information. Each LLC directory entry requires 2 bits for the replica reuse counter
(assuming an optimal RT of 3). The Limited3 classifier tracks the locality information
for three cores. Tracking one core requires 2 bits for the home reuse counter, 1 bit to
store the replication mode and 6 bits to store the core ID (for a 64-core processor).
Hence, the Limited3 classifier requires an additional 27 (= 3 x 9) bits of storage per
LLC directory entry. The Complete classifier, on the other hand, requires 192 (= 64
x 3) bits of storage.
All the following calculations are for one core but they are applicable for the
entire processor since all the cores are identical.
The sizes of the per-core LI and
LLC caches used in our system are shown in Table 2.1. The storage overhead of the
replica reuse bit is
is
256
2x
64 256=
x8
= 13.5KB.
1KB.
The storage overhead of the Limited3 classifier
For the complete classifier, it is
192 256 -
96KB. Now, the
storage overhead of the A CKwise4 protocol in this processor is 12KB (assuming 6
bits per ACKwise pointer) and that for a Full Map protocol is 32KB. Adding up all
the storage components, the Limited3 classifier with A CKwise4 protocol uses slightly
less storage than the Full Map protocol and 4.5% more storage than the baseline
A CKwise4 protocol. The Complete classifier with the A CKwise4 protocol uses 30%
more storage than the baseline A CKwise4 protocol.
LLC Tag & Directory Accesses
Updating the replica reuse counter in the local LLC slice requires a read-modify-write
operation on each replica hit. However, since the replica reuse counter (being 2 bits)
is stored in the LLC tag array that needs to be written on each LLC lookup to update
the LRU counters, our protocol does not add any additional tag accesses.
At the home location, the lookup/update of the locality information is performed
concurrently with the lookup/update of the sharer list for a cache line. However, the
132
lookup/update of the directory is now more expensive since it includes both sharer
list and the locality information.
This additional expense is accounted for in our
evaluation.
Network Traffic
The locality-aware protocol communicates the replica reuse counter to the LLC home
along with the acknowledgment for an invalidation or an eviction.
This is accom-
plished without creating additional network flits. For a 48-bit physical address and
64-bit flit size, an invalidation message requires 42 bits for the physical cache line
address, 12 bits for the sender and receiver core IDs and 2 bits for the replica reuse
counter. The remaining 8 bits suffice for storing the message type.
6.3
6.3.1
Discussion
Replica Creation Strategy
In the protocol described earlier, replicas are created in all valid cache states.
simpler strategy is to create an LLC replica only in the Shared cache state.
A
This
enables instructions, shared read-only and shared read-write data that exhibit high
read run-length to be replicated so as to serve multiple read requests from within
the local LLC slice. However, migratory shared data cannot be replicated with this
simpler strategy because both read and write requests are made to it in an interleaved
manner. Such data patterns can be efficiently handled only if the replica is created
in the Exclusive or Modified state. Benchmarks that exhibit both the above access
patterns are observed in our evaluation (cf. Section 6.6.1).
6.3.2
Coherence Complexity
The local LLC slice is always looked up on an Li cache miss or eviction. Additionally,
both the Li cache and LLC slice is probed on every asynchronous coherence request
(i.e., invalidate, downgrade, flush or write-back). This is needed because the directory
133
only has a single pointer to track the local cache hierarchy of each core. This method
also allows the coherence complexity to be similar to that of a non-hierarchical (flat)
coherence protocol.
To avoid the latency and energy overhead of searching the LLC replica, one
may want to optimize the handling of asynchronous requests, or decide intelligently
whether to lookup the local LLC slice on a cache miss or eviction. In order to enable
such optimizations, additional sharer tracking bits are needed at the directory and
Li cache. Moreover, additional network message types are needed to relay coherence
information between the LLC home and other actors.
In order to evaluate whether this additional coherence complexity is worthwhile,
we compared our protocol to a dynamic oracle that has perfect information about
whether a cache line is present in the local LLC slice. The dynamic oracle avoids all
unnecessary LLC lookups. The completion time and energy difference when compared
to the dynamic oracle was less than 1%.
Hence, in the interest of avoiding the
additional complexity, the LLC replica is always looked up for the above coherence
requests.
6.3.3
Classifier Organization
The classifier for the locality-aware protocol is organized using an in-cache structure,
i.e., the replication mode bits and home reuse counters are maintained for all cache
lines in the LLC. However, this is a not an essential requirement.
The classifier
is logically decoupled from the directory and could be implemented using a sparse
organization.
The storage overhead for the in-cache organization is calculated in Section 6.2.3.
The performance and energy overhead for this organization is small because: (1) The
classifier lookup incurs a relatively small energy and latency penalty when compared
to the data array lookup of the LLC slice and communication over the network (justified in our results). (2) Only a single tag lookup is needed for accessing the classifier
and LLC data. In a sparse organization, a separate lookup is required for the classifier
and the LLC data. Even though these lookups could be performed in parallel with
134
no latency overhead, the energy expended to lookup two CAM structures needs to be
paid.
6.4
Cluster-Level Replication
In the locality-aware protocol, the location where a replica is placed is always the LLC
slice of the requesting core. An additional method by which one could explore the
trade-off between LLC hit latency and LLC miss rate is by replicating at a clusterlevel.
A cluster is defined as a group of neighboring cores where there is at most
one replica for a cache line. Each such replica would service the misses of all the Li
caches in the same cluster. Increasing the size of a cluster would increase LLC hit
latency and decrease LLC miss rate, and decreasing the cluster size would have the
opposite effect. The optimal replication algorithm would optimize the cluster size so
as to maximize the performance and energy benefit.
We explored the benefits of clustering under our protocol after making the appropriate changes. The changes include the following:
1. Blocking at the replica location (the core in the cluster where a replica could
be found) before forwarding the request to the home location so that multiple
cores on the same cluster do not have outstanding requests to the LLC home
location.
&
2. Additional coherence messages for requests and replies between the Li cache
LLC replica location and between the LLC replica & LLC home.
3. Additional storage at the directory for differentiating between LLC replicas and
Li caches when tracking sharers.
4. Additional storage at the Li cache tag to determine whether an Li cache copy
is backed by the LLC replica or the LLC home.
5. Hierarchical invalidation and downgrade of the replica and the Li caches that
it tracks.
135
6. Additional coherence message to dequeue the request at the LLC replica location
incase the LLC home decides not to replicate (in the LLC) but responds directly
to the Li cache.
In addition to these changes, two significant observations were made related to
the ACKwise limited directory protocol.
1. An imprecise tracking of sharers (e.g., such as ACKwise) can only be done at
the LLC home and not at the LLC replica location. At the LLC replica location,
a precise tracking is needed (e.g., using a full-map directory protocol). This is
done to ensure protocol correctness without additional on-chip network support
to stop invalidations from crossing cluster boundaries.
2. While performing broadcast invalidations from the LLC home, only the sharers
(i.e., Li caches and LLC replicas that are tracked from the LLC home)' should
respond with an acknowledgement to the LLC home. Li caches that are backed
by an LLC replica should wait for a matching invalidation before responding.
This is done so as to ensure that invalidation requests take the same path as
replies from the LLC home -+ LLC replica
-+
Li cache. Else, an invalidation
could arrive early, i.e., before the reply, leading to a protocol deadlock.
Overall, cluster-level replication was not found to be beneficial in the evaluated
64-core system, for the following reasons (see Section 6.6.4 for details).
1. Using clustering increases network serialization delays since multiple locations
now need to be searched/ invalidated on an Li cache miss.
2. Cache lines with low degree of sharing do not benefit because clustering just
increases the LLC hit latency without reducing the LLC miss rate.
3. The added coherence complexity of clustering increased our design and verification time significantly.
136
6.5
Evaluation Methodology
We evaluate a 64-core multicore using in-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The parameters specific to the
locality-aware LLC
(/L2)
replication scheme are shown in Table 6.1.
Architectural Parameter
Value
Replication Threshold
Classifier
RT= 3
Limited 3
Table 6.1: Locality-aware LLC (/L2) replication parameters
6.5.1
Baseline LLC Management Schemes
We model four baseline multicore systems that assume private LI caches managed
using the A CKwise4 protocol.
1. The Static-NUCA baseline address interleaves all cache lines among the LLC
slices.
2. The Reactive-NUCA [39] baseline places private data at the requester's LLC
slice, replicates instructions in one LLC slice per cluster of 4 cores using rotational interleaving, and address interleaves shared data in a single LLC slice.
3. The Victim Replication (VR) [971 baseline uses the requester's local LLC
slice as a victim cache for data that is evicted from the Li cache. The evicted
victims are placed in the local LLC slice only if a line is found that is either
invalid, a replica itself or has no sharers in the LI cache.
4. The Adaptive Selective Replication (ASR) [101 baseline also replicates
cache lines in the requester's local LLC slice on an LI eviction.
However, it
only allows LLC replication for cache lines that are classified as shared readonly. ASR pays attention to the LLC pressure by basing its replication decision
on per-core hardware monitoring circuits that quantify the replication effectiveness based on the benefit (lower LLC hit latency) and cost (higher LLC miss
137
latency) of replication. We do not model the hardware monitoring circuits or
the dynamic adaptation of replication levels. Instead, we run ASR at five different replication levels (0, 0.25, 0.5, 0.75, 1) and choose the one with the lowest
energy-delay product for each benchmark.
6.5.2
Evaluation Metrics
Each multithreaded benchmark is run to completion using the input sets from Table 2.3.
We measure the energy consumption of the memory system including the
on-chip caches, DRAM and the network. We also measure the completion time, i.e.,
the time in the parallel region of the benchmark. This includes the compute latency,
the memory access latency, and the synchronization latency.
The memory access
latency is further broken down into:
1. Li to LLC replica latency is the time spent by the Li cache miss request
to the LLC replica location and the corresponding reply from the LLC replica
including time spent accessing the LLC.
2. Li to LLC home latency is the time spent by the Li cache miss request
to the LLC home location and the corresponding reply from the LLC home
including time spent in the network and first access to the LLC.
3. LLC home waiting time is the queueing delay at the LLC home incurred
because requests to the same cache line must be serialized to ensure memory
consistency.
4. LLC home to sharers latency is the round-trip time needed to invalidate
sharers and receive their acknowledgments. This also includes time spent requesting and receiving synchronous write-backs.
5. LLC home to off-chip memory latency is the time spent accessing memory
including the time spent communicating with the memory controller and the
queueing delay incurred due to finite off-chip bandwidth.
138
One of the important memory system metrics we track to evaluate our protocol
are the various cache miss types. They are as follows:
1. LLC replica hits are Li cache misses that hit at the LLC replica location.
2. LLC home hits are Li cache misses that hit at the LLC home location when
routed directly to it or LLC replica misses that hit at the LLC home location.
3. Off-chip misses are Li cache misses that are sent to DRAM because the cache
line is not present on-chip.
6.6
6.6.1
Results
Comparison of Replication Schemes
Figures 6-8 and 6-9 plot the energy and completion time breakdown for the replication
schemes evaluated. The RT-1, RT-3 and RT-8 bars correspond to the locality-aware
scheme with replication thresholds of 1, 3 and 8 respectively.
The energy and completion time trends can be understood based on the following
3 factors: (1) the type of data accessed at the LLC (instruction, private data, shared
read-only data and shared read-write data), (2) reuse run-length at the LLC, and (3)
working set size of the benchmark. Figure 6-10, which plots how Li cache misses are
handled by the LLC is also instrumental in understanding these trends.
Many benchmarks (e.g., BARNES) have a working set that fits within the LLC even
if replication is done on every LI cache miss. Hence, all locality-aware schemes (RT-1,
RT-3 and RT-8) perform well both in energy and performance. In our experiments,
we observe that BARNES exhibits a high reuse of cache lines at the LLC through
accesses directed at shared read-write data. S-NUCA, R-NUCA and ASR do not
replicate shared read-write data and hence do not observe any benefits with BARNES.
VR observes some benefits since it locally replicates read-write data. However,
it exhibits higher energy and completion time than the locality-aware protocol for
the following two reasons. (1) Its (almost) blind process of creating replicas on all
139
" L-1 Cache
0 L1-D Cache
E L2 Cache (LLC)
" Network Router
N Network Link
N DRAM
0 Directory
-1.2
ji0.8
E0.6
tol
..
t
I
00.4
-0.2
0
- 12
CHOLESKY
LU-NC
LU-C
FFT
RADIX
BARNES
OCEAN-C
10
0.8
E0.6
O0.4
0.2
02
11111
'=t
OCEAN-NC
WATER-NSQ
111
>!1I
11111
11111
!Iii1111
VOLREND
RAYTRACE
BLACKSCH.
SWAPTIONS
FLUIDANIM.
1.2
1
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP
AVERAGE
Figure 6-8: Energy breakdown for the LLC replication schemes evaluated. Results are
normalized to that of S-NUCA. Note that Average and not Geometric-Mean is plotted here.
evictions results in the pollution of the LLC, leading to less space for useful replicas
and LLC home lines. This is evident from lower replica hit rate for VR when compared
to our locality-aware protocol. (2) The exclusive relationship between the L1 cache
and the local LLC slice in VR causes a line to always be written back on an eviction
even if the line is clean. This is because a replica hit always causes the line in the
LLC slice to be invalidated and inserted into the L1 cache. Hence, in the common
case where replication is useful, each hit at the LLC location effectively incurs both
a read and a write at the LLC. And a write expends 1.2x more energy than a read.
Similar trends in VR performance and energy exist in the WATER-NSQ, PATRICIA,
BODYTRACK, FACESIM, STREAMCLUSTER and BLACKSCHOLEs benchmarks.
BODYTRACK and FACESIM are similar to BARNEs except that their LLC accesses
140
" Compute
"
LLC-Home-To-Sharers
I LLC-Home--Waiting
L1-To-LLC-Home
4 L1-To-LLC-Replica
U
M LLC-Home-To-OffChip
M Synchronization
1.2
E-
1
0.
0.6
.......
****
E '0 0.4....
S0.2
11211111i
0
ZZ
T
RADIX
FFT
LU-C
LU-NC
OCEAN-NC
WATER-NSQ
RAYTRACE
VOLREND
CHOLESKY
BARNES
OCEAN-C
SWAPTIONS
FLUIDANIM.
1.2
E-
1
0.8
06
00.4
0 0.2
0
BLACKSCH.
1.2
E
1
0.8
06
0.o0.4
0
0.2
0
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
FACESIM
PATRICIA
CONCOMP
AVERAGE
Figure 6-9: Completion Time breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA. Note that Average and not Geometric-Mean is
plotted here.
have a greater fraction of instructions and/or shared read-only data. The accesses to
shared read-write data are again mostly reads with only a few writes. R-NUCA shows
significant benefits since it replicates instructions.
ASR shows even higher energy
and performance benefits since it replicates both instructions and shared read-only
data. The locality-aware protocol also shows the same benefits since it replicates all
classes of cache lines, provided they exhibit reuse in accordance with the replication
thresholds. VR shows higher LLC energy for the same reasons as in
BARNES.
ASR
and our locality-aware protocol allow the LLC slice at the replica location to be
inclusive of the Li cache, and hence, do not have the same drawback as VR. VR,
141
E
LLC-Replica-Hits
0 OffChip-Misses
W LLC-Home-Hits
10 0%
M
8 0%
M"
6 0%
4 0%
0
S
2 0%
0
0%
Z
2F
'U
W'
BARNES
CHOLESKY
LU-NC
LU-C
FFT
RADIX
OCEAN-C
10 0%
IA
WA
8 0%
6 0%
4 0%
0
S
2 0%
0%
Z Z
WATER-NSQ
OCEAN-NC
C
100%
0
80%
30
'U
a'
>.l
22
RAYTRACE
VOLREND
2
2;
BLACKSCH.
SWAPTIONS
FLUIDANIM.
60%
40%
a'
0 20%
0%
-I
2F
STREAMCLUS.
DEDUP
FERRET
BODYTRACK
2;
2F
2F
FACESIM
PATRICIA
CONCOMP
AVERAGE
Figure 6-10: Li Cache Miss Type breakdown for the LLC replication schemes evaluated.
however, does not have a performance overhead because the evictions are not on the
critical path of the processor pipeline.
Note that BODYTRACK, FACESIM and RAYTRACE are the only three among the
evaluated benchmarks that have a significant Li-I cache MPKI (misses per thousand
instructions). All other benchmarks have an extremely low Li-I MPKI (< 0.5) and
hence R-NUCA's replication mechanism is not effective in most cases. Even in the
above 3 benchmarks, R-NUCA does not place instructions in the local LLC slice but
replicates them at a cluster level, hence the serialization delays to transfer the cache
lines over the network still need to be paid.
BLACKSCHOLES, on the other hand, exhibits a large number of LLC accesses to
142
private data and a small number to shared read-only data. Since R-NUCA places
private data in its local LLC slice, it obtains performance and energy improvements
over S-NUCA. However, the improvements obtained are limited since false sharing is
exhibited at the page-level, i.e., multiple cores privately access non-overlapping cache
lines in a page. Since R-NUCA's classification mechanism operates at a page-level, it
is not able to locally place all truly private lines. The locality-aware protocol obtains
improvements over R-NUCA by replicating these cache lines.
ASR only replicates
shared read-only cache lines and identifies these lines by using a per cache-line sticky
Shared bit. Hence, ASR follows the same trends as S-NUCA. DEDUP almost exclusively accesses private data (without any false sharing) and hence, performs optimally
with R-NUCA.
Benchmarks such as RADIX,
FFT, LU-C, OCEAN-C, FLUIDANIMATE
and CONCOMP
do not benefit from replication and hence the baseline R-NUCA performs optimally.
R-NUCA does better than S-NUCA because these benchmarks have significant accesses to thread-private data. ASR, being built on top of S-NUCA, shows the same
trends as S-NUCA. VR, on the other hand, shows higher LLC energy because of the
same reasons outlined earlier. VR's replication of private data in its local LLC slice is
also not as effective as R-NUCA's policy of placing private data locally, especially in
OCEAN-C, FLUIDANIMATE
and CONCOMP whose working sets do not fit in the LLC.
The locality-aware protocol benefits from the optimizations in R-NUCA and tracks
its performance and energy consumption. For the locality-aware protocol, an RT of 3
dominates an RT of 1 in FLUIDANIMATE because it demonstrates significant off-chip
miss rates (as evident from its energy and completion time breakdowns) and hence,
it is essential to balance on-chip locality with off-chip miss rate to achieve the best
energy consumption and performance. While an RT of 1 replicates on every LI cache
miss, an RT of 3 replicates only if a reuse > 3 is demonstrated. Using an RT of 3
reduces the off-chip miss rate in FLUIDANIMATE and provides the best performance
and energy consumption. Using an RT of 3 also provides the maximum benefit in
benchmarks such as OCEAN-C and OCEAN-NC.
As RT increases, the off-chip miss rate decreases but the LLC hit latency increases.
143
For example, with an RT of 8, STREAMCLUSTER shows an increased completion time
and network energy caused by repeated fetches of the cache line over the network.
An RT of 3 would bring the cache line into the local LLC slice sooner, avoiding the
unnecessary network traffic and its performance and energy impact. This is evident
from the smaller "L1-To-LLC-Home" component of the completion time breakdown
graph and the higher number of replica hits when using an RT of 3. We explored all
values of R T between 1 & 8 and found that they provide no additional insight beyond
the data points discussed here.
LU-NC exhibits migratory shared data. Such data exhibits exclusive use (both
read and write accesses) by a unique core over a period of time before being handed
to its next accessor. Replication of migratory shared data requires creation of a replica
in an Exclusive coherence state. The locality-aware protocol makes LLC replicas for
such data when sufficient reuse is detected.
Since ASR does not replicate shared
read-write data, it cannot show benefit for benchmarks with migratory shared data.
VR, on the other hand, (almost) blindly replicates on all Li evictions and performs
on par with the locality-aware protocol for LU-NC.
To summarize, the locality-aware protocol provides better energy consumption
and performance than the other LLC data management schemes.
It is important
to balance the on-chip data locality and off-chip miss rate and overall, an R T of 3
achieves the best trade-off.
It is also important to replicate all types of data and
selective replication of certain types of data by R-NUCA (instructions) and ASR
(instructions, shared read-only data) leads to sub-optimal energy and performance.
Overall, the locality-aware protocol has a 16%, 14%, 13% and 21% lower energy and
a 4%, 9%, 6% and 13% lower completion time compared to VR, ASR, R-NUCA and
S-NUCA respectively.
6.6.2
LLC Replacement Policy
As discussed earlier in Section 6.2.1, we propose to use a modified-LRU replacement
policy for the LLC. It first selects cache lines with the least number of sharers and
then chooses the least recently used among them. This replacement policy improves
144
energy consumption over the traditional LRU policy by 15% and 5%, and lowers
completion time by 5% and 2% in the
BLACKSCHOLES
and
FACESIM
benchmarks
respectively. In all other benchmarks this replacement policy tracks the LRU policy.
6.6.3
Limited Locality Classifier
U k=1
k=3 0 k=5 1.4 k=7 k=64
0.4
*
1.2
N
E
0.8
S0.6
E
0
r_ 0.2
0
E
0
Figure 6-11: Energy and Completion Time for the Limitedk classifier as a function of
number of tracked sharers (k). The results are normalized to that of the Complete (=z
Limited 64 ) classifier.
Figure 6-11 plots the energy and completion time of the benchmarks with the
Limitedk classifier when k is varied as (1, 3, 5, 7, 64). k =64 corresponds to the
Complete classifier.
The results are normalized to that of the Complete classifier.
The benchmarks that are not shown are identical to DEDUP, i.e., the completion time
and energy stay constant as k varies. The experiments are run with the best R T value
of 3 obtained in Section 6.6.1. We observe that the completion time and energy of the
Limited 3 classifier never exceeds by more than 2% the completion time and energy
consumption of the Complete classifier except for STREAMCLUSTER.
With STREAMCLUSTER, the Limited 3 classifier starts off new sharers incorrectly
in non-replica mode because of the limited number of cores available for taking the
145
majority vote. This results in increased communication between the LI cache and
LLC home location, leading to higher completion time and network energy.
The
Limited5 classifier, however, performs as well as the complete classifier, but incurs an
additional 9KB storage overhead per core when compared to the Limited 3 classifier.
From the previous section, we observe that the Limited 3 classifier performs better
than all the other baselines for
Hence, to trade-off the storage
STREAMCLUSTER.
overhead of our classifier with the energy and completion time improvements, we
chose k = 3 as the default for the limited classifier.
The Limited1 classifier is more unstable than the other classifiers. While it performs better than the Complete classifier for
and
STREAMCLUSTER
LU-NC,
it performs worse for the
benchmarks. The better energy consumption in
BARNES
LU-NC
is due
to the fact that the Limited1 classifier starts off new sharers in replica mode as soon as
the first sharer acquires replica status. On the other hand, the Complete classifier has
to learn the mode independently for each sharer leading to a longer training period.
6.6.4
Cluster Size Sensitivity Analysis
Figure 6-12 plots the energy and completion time for the locality-aware protocol when
run using different cluster sizes. The experiment is run with the optimal RT of 3.
Using a cluster size of 1 proved to be optimal. This is due to several reasons.
In benchmarks such as
BARNES,
STREAMCLUSTER
and
BODYTRACK,
where the
working set fits within the LLC even with replication, moving from a cluster size of
1 to 64 reduced data locality without improving the LLC miss rate, thereby hurting
energy and performance.
Benchmarks like
RAYTRACE
that contain a significant amount of read-only data
with low degrees of sharing also do not benefit since employing a cluster-based approach reduces data locality without improving LLC miss rate. Employing a clustered
replication policy leads to an equal probability of placing that cache line in the LLC
slice of any core within the cluster. A cluster-based approach can be useful to explore
the trade-off between LLC data locality and miss rate only if the data is shared by
mostly all cores within a cluster.
146
M C-1
* C-4
W
WN
N C-16
_' C-64
1.2
:N 0.8
0. 6
1.
0.4
0.2
E..... 01
W
&
!
0.8
E
o"
Q
M
dP
& Pe.C
UC,
0
Figure 6-12: Energy and Completion Time at cluster sizes of 1, 4, 16 and 64 with the
locality-aware data replication protocol. A cluster size of 64 is the same as R-NUCA except
that it does not even replicate instructions.
In benchmarks such as RADIX and FLUIDANIMATE that show no usefulness for
replication, applying the locality-aware protocol bypasses all replication mechanisms
and hence, employing higher cluster sizes would not be any more useful than employing a lower cluster size. Intelligently deciding which cache lines to replicate using an
RT of 3 was enough to prevent any overheads of replication.
The above reasons along with the added coherence complexity of clustering (as
discussed in Section 6.4) motivate using a cluster size of 1, at least in the 64-core
multicore target that we evaluate.
6.7
Summary
This chapter proposed an intelligent locality-aware data replication scheme for the
last-level cache. The locality is profiled at runtime using a low-overhead yet highly
accurate in-hardware cache-line-level classifier. On a set of parallel benchmarks, the
locality-aware protocol reduces the overall energy by 16%, 14%, 13% and 21% and
147
the completion time by 4%, 9%, 6% and 13% when compared to the previously
proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA and
Static-NUCA LLC management schemes. The coherence complexity of the described
protocol is almost identical to that of a traditional non-hierarchical (flat) coherence
protocol since replicas are only allowed to be created at the LLC slice of the requesting
core. The classifier is implemented with 14.5KB storage overhead per 256KB LLC
slice.
148
Chapter 7
Locality-Aware Cache Hierarchy
Replication
This chapter combines the private cache (i.e., LI) and LLC (i.e., L2) replication
protocols discussed in Chapters 4 & 6 into a combined cache hierarchy replication
protocol. The combined protocol exploits the advantages of both the private cache
& LLC replication protocols synergistically.
7.1
Motivation
The design of the protocol is motivated by the experimental observation that both
LI & L2 cache replication enable variable performance improvement for benchmarks.
Certain benchmarks like TSP & DFS exhibit improvement only with locality-aware
Li replication, certain others like BARNES & RAYTRACE exhibit improvement only
&
with locality-aware L2 replication, while certain benchmarks like BLACKSCHOLES
FACESIM exhibit improvement with both locality-aware LI & L2 replication.
This
necessitates the design of a combined LI & L2 replication protocol that obtains the
benefits of both Li & L2 replication. The combined protocol should be able to provide
at least the following 3 data access modes:
1. Replicate line in LI cache and access at L2 home location (this is the default in
both locality-aware Li & L2 replication).
149
2. Replicate line in both Li & L2 caches, to leverage the benefits of locality-aware
L2 replication.
3. Do not replicate line in Li cache and remote access at the L2 home location
(using word-level operations), to leverage the benefits of locality-aware Li replication.
In addition to the above 3 modes, a
4 th
mode can be supported for increased
efficiency.
4. Do not replicate line in Li cache, replicate line in L2 cache and remote access
at the L2 replica location (using word-level operations).
The
4 th
mode is useful for applications where cache lines do not exhibit much
reuse at the Li cache (since the per-thread working set does not fit within the LI)
but exhibit significant reuse at the L2 cache.
Compute
Pipeline
Core
L1-1/
pute
.1/
-
Compu
L2 Cache
(LLC Slice)
L1-I/
U-D
ComI put
L1-I/
Corm
L.1-D
pu
L1-I/
L1-D
12
-/
pute
Com-
-l/
Com- Li I/
C - Li-I
Com- Li-I/
pute
-D
pute
p
pute
L D
212
L1-
Li-D
L2
12
Li
Compute
Li
D
L2
1
Com-
Network Router
C
pute
12
C
Private
Li Caches
L1-l/
Co
/
Co
put
pute
Li
I/
D
Compute
Li-/
Li-D
L2l
Li Replica
Comute
EZZ2
I/
D
Compute
Li-I
L2 Replica
1-
Cjm-
L1
L2J
Co
LiLiCpute]L1pu
L2 Home
Block
Figure 7-1: The red block prefers to be in the 1st mode, being replicated only at the
Li-I/Li-D cache. Cores N and g access data directly at the Li cache. The blue block
prefers to be in the 2 nd mode, being replicated at both the Li & L2 caches. Cores
and
P access data directly at the Li cache. The violet block prefers to be in the 3 rd mode,
and is accessed remotely at the L2 home location without being replicated in either the Li
or the L2 cache. Cores P and ( access data using remote-word requests at the L2 home
location. And finally, the green block prefers the 4 th mode, being replicated at the L2 cache
and accessed using word accesses at the L2 replica location. Cores g and T access data
using remote-word requests.
150
These 4 modes of data access are depicted in Figure 7-1. The red block prefers
to be in the
to be in the
1
" mode, being replicated only at the Li cache. The blue block prefers
2 nd
mode, being replicated at both the Li & L2 caches.
block prefers to be in the
3 rd
mode, and is accessed remotely at the L2 home location
without being replicated in either the Li or the L2 cache.
block prefers the
4 t"
The violet
And finally, the green
mode, being replicated at the L2 cache and accessed using word
accesses at the L2 replica location.
7.2
Implementation
The combined replication protocol starts out as a private-Li shared-L2 cache hierarchy. If a cache line shows less or more reuse at the Li or L2 cache, its classification
is adapted accordingly. We first describe the hardware modifications that need to be
implemented for the functionality of the protocol and later walk through how memory requests are handled. This includes details about how the protocol transitions
between the multiple data access modes shown in Figure 7-1.
7.2.1
Microarchitecture Modifications
Since both locality-aware Li & L2 replication need to be implemented, each cache
line tag at the L2 home location should contain both the Li-Classifier and the L2Classifier as shown in Figure 7-2. The Li-Classifier manages replication in the Li
cache while the L2-Classifier manages replication in the L2 cache.
Both the Li-
Classifier and L2-Classifier are looked up on access to the L2 home and together
decide whether the cache line should be replicated at the Li or L2 or not replicated
at all. The Li & L2 classifiers each contain Mode, Home Reuse, and RAT-Level fields
that serve to remember the past locality information for each cache line. (Note that
the RAT-Level field was only used for private cache (/Li) replication as introduced
in Chapter 4 and was not employed by the LLC (/L2) replication mechanism. The
use of the RAT-Level in the combined mechanism will be explained later).
On the other hand, a classifier at the L2 replica location should contain just
151
Tag
LRU
Tag
RU
State
Sate
ACK Wise
Pointers (1
... p)
Limited Locality List (1 .. k)
Figure 7-2: Modifications to the L2 cache line tag. Each cache line is augmented with a
Private Reuse counter that tracks the number of times a cache line has been accessed at the
L2 replica location. In addition, each cache line tag has classifiers for deciding whether or not
to replicate lines in the Li cache & L2 cache. Both the Li & L2 classifiers contain the Mode,
Home Reuse, and RAT-Level fields that serve to remember the past locality information for
each cache line. The above information is only maintained for a limited number of cores, k,
and the mode of untracked cores is obtained by a majority vote.
the Li-Classifier to determine whether the cache line should be replicated at the Li
or not. In addition, the cache tag at the L2 replica location should also contain a
Private Reuse counter that tracks the number of times the cache line at the L2 replica
location has been reused. This counter is communicated back to the L2 home on an
invalidation/eviction to determine future replication decisions.
To maintain hardware simplicity, each L2 cache tag holds all the above described
fields (i.e., the Li-Classifier, L2-Classifier and the Private Reuse counter). The L2Classifier is not used at the L2 replica location and the Private Reuse counter is not
used at the L2 home location. Together, these hardware structures implement the 4
modes of data access as explained in Section 7.1.
7.2.2
Protocol Operation
We consider read requests first, write requests next and finally, evictions & invalidations.
152
Read Requests:
At Li Cache: On a read request (includes an instruction cache access), first the Li
cache is looked up. If the data is present, it is returned to the compute pipeline and
the private reuse counter at the Li cache is incremented. The private reuse counter
tracks the number of times a cache line has been reused at the Li cache and serves to
make future decisions about whether a cache line must be replicated at the Li cache.
It is communicated back on an invalidation or eviction.
At L2 Replica: If the data is not present at the Li cache, the request is sent to
the L2 replica location. If the data is present at the L2 replica location, then the
Li-classifier is looked up to get the mode of the cache line. If the mode is private,
then a read-only copy of the cache line is transferred back to the requesting core.
The line is inserted into the Li cache with the private reuse counter initialized to
'1'. If the mode is remote, then the remote reuse counter (of the Li-classifier) at the
L2 replica location is incremented to indicate that the cache line was reused at its
remote location. If the remote reuse is > PCT, the sharer is "promoted" to private
mode and a cache line copy is handed to it. Else, the requested word is returned to
the core. Finally, the private reuse counter at the L2 replica is also incremented to
indicate that the cache line has been reused at the L2 replica location.
At L2 Home: If the data is not present at the L2 replica location, the request is
forwarded to the L2 home location. If the data is present at the L2 home location,
then the directory (co-located with the L2 home) obtains the most recent copy of the
cache line from the sharers. Then, both the Li-classifier and the L2-classifier are
looked up to get the mode of the cache line.
If the L2-classifier indicates that the mode is Private, the line is sent to the L2
replica location and the mode provided by the Li-classifier at the L2 home is used
to initialize the mode in the Li-classifier at the L2 replica location.
On the other
hand, if the L2-classifier returns a Remote mode, then the remote reuse counter in
the L2-classifier is incremented. The messages sent out depend on the mode returned
by the Li-classifier.
153
If the Li-classifier indicates that the mode is Private, then the cache line is sent to
the Li cache. But if the Li-classifier also returns a Remote mode, then the requested
word is directly sent to the core. The remote reuse counter of the Li-classifier at the
L2 home location is also incremented.
At DRAM Controller: If the data is not present at the L2 home location, the
request is forwarded to the DRAM controller. Once the cache line is returned from
DRAM, the Li-classifier and L2-classifier are initialized such that the cache line tracks
the default private-L1 shared-L2 cache hierarchy.
Write Requests:
At Li Cache: On a write request, first the Li cache is looked up. If the data is
present, then the word (i.e., write data) is directly written to the Li cache.
The
private reuse counter at the Li cache is incremented to indicate that that the cache
line has been reused.
At L2 Replica: If the data is not present at the Li cache, the request is sent to
the L2 replica location. If the data is present at the L2 replica location, then the Liclassifier is looked up to get the mode of the cache line. If the mode is Private, then
the line is transferred back to the requesting core. The line is inserted into the Li
cache with the private reuse counter initialized to 1. If the mode is remote, then the
remote reuse counter (of the Li-classifier) at the L2 replica location is incremented
to indicate that the cache line was reused at its remote location. If the remote reuse
is > PCT, the sharer is "promoted" to private mode and a shared read-write copy of
the cache line copy is handed to it. Else, the word is directly written to the L2 cache.
The private reuse counter at the L2 replica is also incremented to indicate that the
cache line has been reused at the L2 replica location.
At L2 Home: If the data is not present at the L2 replica location, the request is
forwarded to the L2 home location. If the data is present at the L2 home location,
then the directory (co-located with the L2 home) performs the following actions: (1)
it invalidates all the private sharers of the cache line, and (2) it sets the remote reuse
counters of all its remote sharers to '0'. The Li-classifier and the L2-classifier are
154
then looked up to get the mode of the cache line.
If the L2-classifier indicates that the mode is Private, a private read-write copy
of the line is sent to the L2 replica location. The mode provided by the Li-classifier
at the L2 home is used to initialize the mode in the Li-classifier at the L2 replica
location.
On the other hand, if the L2-classifier returns a Remote mode, then the
remote reuse counter in the L2-classifier is incremented. The response sent out now
depends on the mode returned by the Li-classifier.
If the Li-classifier indicates that the mode is Private, then a private read-write
copy of the cache line is sent to the Li cache. But if the Li-classifier also returns a
Remote mode, then the word is directly written to the L2 cache. The remote reuse
counter of the Li-classifier at the L2 home location is also incremented.
At DRAM Controller: If the data is not present at the L2 home location, the
request is forwarded to the DRAM controller. Once the cache line is returned from
DRAM, the Li-classifier is initialized to Private mode while the L2-classifier is initialized to Remote mode. This enables the cache line to be replicated in the Li cache
but not in L2 cache by default.
Evictions and Invalidations:
When the cache line is removed from the private Li cache/L2 replica location due
to eviction (conflict or capacity miss) or invalidation (exclusive request by another
core), the private reuse counter is communicated to its backing location.
From Li Cache: On an invalidation response from the Li cache to the L2 replica
L2 home location, the Li-classifier is looked up to get the remote reuse corresponding
to the sharer. If the (private
+ remote) reuse is > PCT, the line is still allowed to
be replicated in the Li cache. Else, the line is not allowed to be replicated in the Li
cache and "demoted" to a remote sharer.
On an eviction response from the Li cache to the L2 replica
/ L2 home location,
the private reuse is compared to PCT If private reuse > PCT, the line is continued
to be replicated in the Li cache. Else, the line is "demoted" to the status of a remote
sharer. The remote access threshold (RAT) is also increased to the next level.
155
From L2 Replica: On an invalidation response from the L2 replica to the L2 home,
the L2-classifier is looked up to get the remote reuse. If the (private + remote) reuse
is > RT, the line is continued to be replicated in the L2 cache. Else, the line is not
allowed to be replicated and the core is "demoted" to a remote sharer.
On an eviction response from the L2 replica to the L2 home location, the private
reuse is compared to PCT. If private reuse > RT, the line is continued to be replicated
in the L2 cache. Else, the line is "demoted" to the status of a remote sharer. The
remote access threshold (RAT) in the L2-classifier is also increased to the next level.
7.2.3
Optimizations
Remote Access Threshold (RAT): To improve the performance of the Li-classifier
and L2-classifier working together, the Remote Access Threshold (RAT) scheme is
used (c.f. Section 4.3). On an eviction where the next mode is remote, the classifier
incrementally raises the threshold for transitions from remote to private mode. This
is done so as to make it harder for the core to be promoted to private mode in case
it wants to stay in remote mode. This prevents the core from ping-pong'ing between
private & remote modes. With the combined classifier, this also presents either the
Li-classifier or the L2-classifier with a fair chance of obtaining the cache line in private
mode in case the other one continually classifies the line into remote mode.
Limited Locality Classifier: The Limited 3 classifier is used to reduce the overhead
needed to track the mode, remote reuse, and RAT-level counters. These counters are
only tracked for 3 cores and the modes of other cores are obtained using a majority
vote. The management of this limited classifier is done in the same way as described
previously in Chapters 4 & 6.
7.2.4
Overheads
Each L2 cache tag contains a private reuse counter as well as the Li and L2 classifiers.
Assuming a PCT of 4, an R T of 3, the Limited 3 classifier, 2 RATlevels and an RATa,
of 16, the storage overhead for each cache tag = 2+3 x (6+ 2 x (I+ 4 +1))
156
= 56 bits.
The storage overhead per-core for L2 cache tags = 56 x 212 bits = 28 KB.
Each LI cache tag contains a private reuse counter as well. The storage overhead
per-core for LI cache tags = 2 x 3 x 28 bits = 0.19 KB. The timestamp scheme,
that serves to implement load speculation while conforming to a particular memory
consistency model, incurs 2.5 KB overhead. Overall, the storage overhead per-core is
30.7 KB.
7.3
Evaluation Methodology
We evaluate a 64-core shared memory multicore using out-of-order cores. The default
architectural parameters used for evaluation are shown in Table 2.1. The parameters
specific to the timestamp-based speculation violation detection and the locality-aware
protocols are shown in Table 5.1.
Architectural Parameter
Value
Locality-Aware Private Cache Replication
Private Caching Threshold
PCT= 4
Max Remote Access Threshold
RATmaX = 16
Number of RAT Levels
nRA Tevels = 2
Classifier
Limited 3
Locality-Aware LLC Replication
Replication Threshold
RT= 3
Max Remote Access Threshold
RA Tmax = 16
Number of RAT Levels
nRATievels = 2
Classifier
Limited 3
Out-of-Order Core
Speculation Violation Detection
Timestamp-based
Timestamp Width (TW)
16 bits
History Retention Period (HRP)
512 ns
LI Load/Store History Queue (L1LHQ/L1SHQ) Size 0.8 KB, 0.4 KB
L2 Load/Store History Queue (L2LHQ/L2SHQ) Size 0.4 KB, 0.2 KB
0.2 KB
Pending Store History Queue (PSHQ) Size
Table 7.1: Locality-aware protocol & timestamp-based speculation violation detection
parameters.
157
7.4
Results
In this section, the following four schemes are compared.
1. RNUCA: Reactive-NUCA is the baseline scheme that implements the data
placement and migration techniques of R-NUCA (basically, the locality-aware
protocol with a PCT of 1).
2. Li: Locality-aware Li (/private cache) replication with a PCT of 4.
3. L2: Locality-aware L2 (/LLC) replication with a RT of 3.
4. L1+L2: Locality-aware L1+L2 (cache hierarchy) replication with a PCT of 4
and RT of 3.
Figures 7-3 and 7-4 plot the completion time and energy obtained when using the
above 4 design alternatives.
Completion Time:
We observe that the Li
+ L2 scheme in general, tracks the best of locality-aware
Li or L2 replication. For example, Li
+ L2 tracks the performance of Li in bench-
marks such as CONCOMP and TSP and the performance of L2 in benchmarks such as
PATRICIA and LU-NC. In the FACESIM benchmark where both Li & L2 provide benefits, the Li
+ L2 scheme improves upon the benefits provided by the two protocols.
This is because Li
+ L2 possesses the functionality of both the Li and L2 schemes
and can adaptively decide the best mode for a line.
Only in two benchmarks OCEAN-NC & STREAMCLUSTER does Li + L2 perform
worse than the best of Li and L2. This is due to load speculation violations created
by the timestamp scheme.
These load speculation violations arise in the critical
section of the application, and hence, they increase the synchronization time as well.
Note, that in RNUCA and the L2 scheme, cache lines are always replicated in the Li
cache, hence invalidation/ update requests can be relied on to detect load speculation
violations. We don't model violations due to invalidation requests in our simulator,
so the performance provided by RNUCA and the L2 is an upper bound.
158
E Instructions
A 1-- Fetch Stalls
4 Branch Speculation
E Load Speculation
0 Compute Stalls
0 Memory Stalls
N Synchronization
N Idle
-1.2
Ili
E0.8
EEl
iiI
11
0.6
0.4
0
iI
0.2
0
Ia0.611~ IIi1 11VI
111li
DI
VO REND
BARNES
CHOLESKY
RAYTRACE
OCEANNC
Z
Z
RADIX
FT
LULC
LUG
1.2
E.1
C
.8
'00.4
0
Z
z
Z
WTRNQ
CANNEAL
FACESIM
SWAPTIONS
STREAMCLUS.
Z
Z
Z
FLUIDANIM.
BLACKSCH.
BODYTRACK
DEDUP
1.2
1
0
=~
wE
hEre.
nrazetotat
and
Figuees7-3
R-UA
Af
a5esetv
etAt
Wpae
A
ccthe RNUC
ead
o
emercMa Ae
As p
Ate
baseccne
CmpetibonTmereadoo
the srcheest evauaegy Rensumtiar
aOverltheLi, L2sandes Lid n shmesbnhaks tee improves ieb 13%, 10%
best. The Li + L2 scheme tracks the energy of the Li scheme in benchmarks such
as RADIX, DIJKSTRA-AP & DFs and the energy of the L2 scheme in benchmarks such
159
L1-1 Cache
U
8
L1-D Cache
E Network Router 0 Network Link
N L2 Cache
" Directory
* DRAM
" History Queues
1.2
m
I
I
I
I
I
.
1i
I
0.8
0.6
PO.4
0
1111
....
----
+
DI
-j
Z
BARNES
VOLREND
CHOLESKY
FFT
OCEAN-NC
RAYTRACE
1.2
I
LU-C
RADIX
I I
U
+
0.2
U
LU-NC
a
m0.8
0.6
0.4
CU
0.2
0
D
-4
IC4
Z
WATER-NSQ
1.2
-J:
CANNEAL
FACESIM
SWAPTIONS
STREAMCLUS.
I
z
W
+
<
-
LU
FLUIDANIM.
BLACKSCH.
BODYTRACK
DEDUP
a
N
111
m0.8
LU 0.2
0
III
iLiLi
+
5i
D
FERRET
III'U
CCi~
PATRICIA
Di
z'C
CONCOMP
Li
-I:
_J:
Z
COMMUNITY
Z
W
C-4
-j
+
rl
-j
DIJKSTRA-AP
C*4
-j
-I
-j
+
r-4
D
Z
ot
11
+
0.4
+
00.6
-j q:
cc
TSP
DFS
MATMUL
AVERAGE
to
Figure 7-4: Energy breakdown for the schemes evaluated. Results are normalized
here.
plotted
is
that of R-NUCA. Note that Average and not Geometric-Mean
as BARNES, BLACKSCHOLES and SWAPTIONS.
In benchmarks such as RAYTRACE,
L2 scheme has a
FACESIM, STREAMCLUSTER, BODYTRACK, and PATRICIA, the Li +
to
lower energy consumption than both the Li and L2 schemes due to its capability
the line. In
select the best of the two at cache line granularity based on the reuse of
160
addition, the Li + L2 scheme also introduces a 4 "h mode which allows cache lines to
be replicated in the L2 but not in the L1. This allows more efficient access to cache
lines that have a higher reuse distance.
In benchmarks such as
VOLREND,
and
CHOLESKY
whose energy is dominated by
the Ll-D cache (since they are only a few Li cache misses), both the Li and the
L I + L2 scheme have to incur the energy overhead of accessing the LI and L2 history
queues on every cache access, and this increases their overall energy consumption.
In such benchmarks, the Li + L2 scheme is only able to perform as well as the Li
scheme. In other benchmarks such as
WATER-NSQ
and
LU-NC,
the Li + L2 scheme
gets all the network energy benefits of the L2 scheme, but incurs the overhead for
access to the history queues.
In the CONCOMP benchmark, most cache lines prefer not to be replicated in the Li
cache and accessed remotely at the L2 home location due to almost no reuse at the Li
cache (an Li cache miss rate of 42%). The L1+L2 scheme pays an occasional overhead
of placing these cache lines in the L2 replica location in order to learn whether they
are reused at the L2. This overhead increases the network & DRAM energy due to
additional evictions from the L2 cache by a small amount. In the
FLUIDANIMATE
benchmark, the modes of cache lines change extremely frequently over time, thereby
causing the locality-aware coherence protocols (that are based on the immediate past
history) to incur false classifications. This increases the network and L2 cache energy
consumption over the RNUCA baseline.
Overall, the L1, L2 and Li + L2 schemes improve energy by 15%, 15% and 22%
respectively compared to the RNUCA baseline.
7.4.1
PCT and RT Threshold Sweep
In this section, the PCT and RT parameters of the L1 + L2 scheme are varied from
1 to 8 and the resulting completion and energy are plotted in Figures 7-5 and 7-6.
We observe that the completion time & energy are high at low values (i.e., 1,2) of
PCT & RT. This is due to the network traffic overheads, low cache utilization, and
the resultant processor stalls incurred due to replicating low reuse cache lines in the
161
I
0.92
0.91
0.910.9
. 0.88,
0.88
-40,9
0.89
0
8
6
2
2
0 0
Replication
Threshold (RT)
Private Caching
Threshold (PCT)
Figure 7-5: Variation of Completion Time as a function of PCT& R T. The GeometricMean of the completion time obtained from all benchmarks is plotted.
0.9
0.95
6
8
6
4
4
08
0.9
0.85
0.8
8
Replication
Threshold (RT)
0.75
2
2
0
0
Private Caching
Threshold (PCT)
Figure 7-6: Variation of Energy as a function of PCT & RT. The Geometric-Mean of
the energy obtained from all benchmarks is plotted.
162
Li & L2 cache. As PCT & RT increase to mid-range values (i.e., to 3,4,5), both the
completion time & energy consumption reduce drastically.
After that, completion
time & energy increase gradually.
A PCT of 4 and an RT of 3 is selected because they provide the best Energy x
Delay product among the possible <PCT,RT> combinations. If the best <PCT,RT>
combination is selected for each benchmark separately, the completion time and energy are only improved by 2% compared to using a PCT of 4 and an RT of 3 for all
benchmarks. This justifies a static selection for PCT & RT.
7.5
Summary
This chapter combines the private cache (i.e., Li) and LLC (i.e., L2) replication
schemes discussed in Chapters 4 & 6 into a combined cache hierarchy replication
&
scheme. The combined scheme exploits the advantages of both the private cache
LLC replication protocols synergistically. Overall, evaluations on a 64-core multicore
processor show that locality-aware cache hierarchy replication improves completion
time by 15% and energy by 22% compared to the Reactive-NUCA baseline and can
be implemented with a 30.7 KB storage overhead per core.
163
164
Chapter 8
Related Work
Previous research on cache hierarchy organizations & implementation of memory
consistency models in multicore processors can be discussed based on the following
eight criteria.
1. Data replication
2. Coherence directory organization
3. Selective caching
/ dead block eviction
4. Remote Access
5. Data placement and migration
6. Cache replacement policy
7. Cache partitioning
/ cooperative cache management
8. Memory Consistency Models
8.1
Data Replication
Previous research on data replication in multicore processors mainly focused on the
last level cache. All other cache levels have traditionally been organized as private to
165
a core and hence data can be replicated in them based on demand without any additional control strategy. Last level caches (LLCs) have been organized as private
[26],
shared [2] or a combination of both [97, 24, 10, 39].
The benefits of having a private or shared LLC organization depend on the degree
of sharing in an application as well as data access patterns. While private LLC organizations have low hit latencies, their off-chip miss rates are high in applications that
exhibit high degrees of sharing due to cache line replication. Shared LLC organizations, on the other hand, have high hit latencies since each request has to complete a
round-trip over the interconnection network. This hit latency increases as more cores
are added since the diameter of practically feasible on-chip networks increases with
the number of cores. However, their off-chip miss rates are low since cache lines are
not replicated.
Both private and shared LLC organizations incur significant protocol latencies
when a writer of a cache block invalidates multiple readers; the impact being directly
proportional to the degree of sharing of the cache block. (Note that processors with
shared LLC organizations typically have private lower-level caches).
Four recently proposed hybrid LLC organizations that combine the good characteristics of private and shared LLC organizations are CMP-NuRAPID [241, Victim
Replication [97], Adaptive Selective Replication [10], and Reactive-NUCA [39].
CMP-NuRAPID (Non-Uniform access with Replacement And Placement usIng
Distance associativity) [24] uses Controlled Replication to place data so as to optimize
the distance to the cache bank holding the data. The idea is to decouple the tag and
data arrays and maintain private per-core tag arrays and a shared data array. The
shared data array is divided into multiple banks based on distance from each core.
A cache line is replicated in the cache bank closest to the requesting core on its
second access, the second access being detected using the entry in the tag array. This
scheme does not scale with the number of cores since each private per-core tag array
potentially has to store pointers to the entire data array. Results indicate that the
private tag array size used in CMP-NuRAPID should only be twice the size of the
per-cache bank tag array but this is because only a 4-core CMP is evaluated.
166
In
addition, CMP-NuRAPID requires snooping coherence to invalidate replicas as well
as additional cache controller transient states for ensuring the correct ordering of
invalidations and read accesses.
Victim Replication (VR) starts out with a private Li shared L2 organization and
uses the local L2 slice as a victim cache for data that is evicted from the Li cache.
The eviction victims are placed in the L2 slice only if a line is found that is either
invalid, a replica itself or has no sharers in the Li cache.
By only replicating the
Li capacity victims, this scheme attempts to combine the low hit latency of private
LLCs with the low off-chip miss rates of shared LLCs. However, this strategy blindly
replicates all Li capacity victims without paying attention to cache pressure.
Adaptive Selective Replication (ASR) operates similar to Victim Replication by
replicating cache lines in the local L2 slice on an Li eviction.
However, it pays
attention to cache pressure by basing its replication decision on a probability. The
probability value is picked from discrete replication levels on a per-cache basis. A
higher replication level indicates that Li eviction victims are replicated with a higher
probability.
The replication levels are decided dynamically based on the cost and
benefit of replication. When operating at a particular replication level, ASR estimates
the cost and benefit of increasing or decreasing the level using 4 hardware monitoring
circuits.
Both VR and ASR have the following three drawbacks.
1. L2 replicas are allocated without paying attention to whether they will be referenced in the near future. Applications with a huge working set or those with
a high proportion of compulsory misses (streaming workloads) are adversely
affected by this strategy.
Applications with cache lines that are immediately
invalidated without additional references from the Li do not benefit as well.
2. If the data is not found in the local L2 slice, the read request has to be sent to
the home L2 cache. The capacity of the L2 cache is not shared amongst neighboring cores.
Applications that widely share instructions and data (through
read accesses) are not well served by pinning the replication location to the
167
local L2 slice. Sharing a replication location amongst a cluster of cores would
have been the optimal strategy.
3. The L2 cache slices have to always be searched and invalidated along with the Li
cache. Although the performance overhead is small, it increases the complexity
of the protocol.
Reactive-NUCA replicates instructions in one LLC slice per cluster of 4 cores
using rotationalinterleaving. Data is never replicated and is always placed at a single
LLC slice.
The one size fits all approach to handling instructions does not work
for applications with heterogeneous instructions nor does it work for applications
where the optimal cluster size for replication is not 4. In addition, the experiments
in Chapter 6 show significant opportunities for improvement through replication of
shared read-only and shared read-write data that is frequently read and sparsely
written.
The locality-aware data replication scheme discussed in this thesis does not suffer
from the limitations of the above mentioned schemes. It only replicates cache lines
that show reuse at the LLC, bypasses replication mechanisms for cache lines that
do not exhibit reuse and adapts the cluster size according to application needs to
optimize performance and energy consumption. The cache lines that are replicated
are purely those that exhibit reuse and include instructions, shared read-only and
shared read-write data. No coarse-grain classification decisions guide the replication
process.
In addition to the above mentioned drawbacks, all the schemes discussed leave the
private caches unmanaged.
A request for data allocates a cache line in the private
cache hierarchy even if the data has no spatial or temporal locality. This leads to cache
pollution since such low locality cache lines can displace more frequently used data.
The locality-aware coherence protocol discussed in this thesis focuses on intelligent
management of private caches. Managing private caches is important because (i) they
are generally capacity-stressed due to strict size and latency limitations, and (2) they
replicate shared data without paying any attention to its locality.
168
8.2
Coherence Directory Organization
Several proposals have been made for scalable directory organizations.
Techniques
include reducing the size of each directory entry as well as increasing the scalability
of the structure that stores directory entries.
Hierarchical directory organizations [64] enable area-efficient uncompressed vector storage through multiple serialized lookups. However, hierarchical organizations
impose additional lookups on the critical path, hurting latency and increasing complexity. Limited directories [6] have been proposed that invalidate cache lines so as to
maintain a constant number of sharers. Such schemes hurt cache lines that are widely
read-shared. Limited directories with software support [21] remove this restriction but
require OS-support.
Chained directories [20] maintain a linked list of sharers (one
LI cache tag pointing to another) but are complex to implement and verify due to
distributed linked list operations. Coarse vectors [37] maintain sharing information
per cluster of cores and hence, cause higher network traffic and complexity.
Duplicate-Tag directory [9, 88] reduces storage space but requires an energyintensive associative search to retrieve sharing information.
creases as more cores are added. Tagless directory
This associativity in-
[96] removes the energy-inefficient
associative lookup of the Duplicate-Tag organization using Bloom filters to represent
the sharers.
However, it adds a lot of extra complexity since false positives (shar-
ers marked present at the directory but not actually present) need to be handled
correctly. Moreover, Tagless requires extra lookup and computation circuitry during
eviction and invalidation to reset the relevant bloom filter bits.
Sparse directory schemes [37, 70] organize the directory like a set-associative cache
with low associativity and are more power-efficient than Duplicate-Tag directories.
But they incur directory-induced back-invalidations when some sets are more heavily
accessed than other. Hence, considerable area cost is expended in over-provisioning
the directory capacity to avoid set conflicts. Cuckoo directory [32] avoids these set
conflicts using an N-ary Cuckoo Hash Table with different hash functions for each
way. Unlike a regular set-associative organization that always picks a replacement
169
victim from a small set of conflicting entry, the Cuckoo directory displaces victims to
alternate non-conflicting ways, resorting to eviction only in exceptional circumstances.
Scalable Coherence Directory (SCD) [80] introduces variable sharer set representation to store the sharers of a cache line. While a cache line with a few sharers uses
a single directory tag, widely shared cache lines uses multiple tags. SCD operates like
a limited directory protocol when the sharers can be tracked with the single directory
tag. As the number of sharers grow, it switches to hierarchical sharer tracking with
the root tag tracking sharing at a cluster-level and the leaf tags tracking the sharing
within each cluster. SCD uses Cuckoo directories/ ZCache [78] for high associativity.
In-cache directory [18, 13] avoids the overhead of adding a separate directory structure as in Sparse directories but is area-inefficient because the lower level caches (LI)
are much smaller than the higher level ones (L2). However, in-cache directories do
not suffer from back-invalidations. For a CMP, in-cache directories are only practical
with at least one shared cache in its cache hierarchy.
SPACE [99] (Sharing pattern-based directory coherence) exploits sharing pattern
commonality to reduce directory storage. If multiple cache lines are shared by the
same set of cores, SPACE allocates just one sharing pattern for them. However, these
sharing patters must be replicated if the directory is distributed among multiple cores
(which needs to be done in modern CMPs to avoid excessive contention). SPATL [100]
(Sharing-PAttern based TagLess directory) decouples the sharing patterns from the
bloom filters of Tagless and eliminates the redundant copies of sharing patterns.
Although this results in improved directory storage over Tagless and SPACE, the
additional complexity of Tagless remains.
In-Network Cache Coherence [29] embeds directory information in network routers
so as to remove directory indirection delays and fetch cache lines from the closest
core. The directory information within the network routers is organized as a tree and
hence requires a lot of extra complexity and latency during writes to cleanup the tree.
Additionally, each router access becomes more expensive.
The above described schemes are either inefficient or increase the directory organization complexity significantly using a compressed representation of the sharers or
170
require complex on-chip network capabilities. The ACKwise [56, 57] directory coherence protocol on the other hand uses a simple sharer representation with a limited
directory and relies on simple changes to an electrical mesh network for broadcasts.
Since the number of sharers is always tracked, acknowledgements are efficient.
In
addition, the locality-aware cache coherence protocol reduces the number of invalidations to cache lines with low-spatio temporal locality, making ACKwise more efficient.
The locality-aware data replication scheme tries to incorporate the benefits of hierarchical directories (i.e., hierarchical sharer tracking and data locality) by replicating
those cache lines with high utility at the L2 cache. In addition, it avoids the pitfalls of
hierarchical directories by not performing serial lookups or hierarchical invalidations
for cache lines that do not benefit from replication.
8.3
Selective Caching / Dead-Block Eviction
Several proposals have been made for selective caching in the context of uniprocessors.
Selective caching, a.k.a. cache bypassing, avoids placing the fetched cache line in the
cache and either discards it or places it in a smaller temporary buffer so as to improve
cache utilization. Most previous works have explored selective caching in the context
of prefetching. Another related body of work is dead-block eviction. Dead-blocks are
cache lines that will not be reused in the future and hence may be removed so as to
improve prefetching or reduce cache leakage energy.
McFarling
[67 proposed dynamic exclusion to reduce conflict misses in a direct-
mapped instruction cache. Stream Buffers [48] place the prefetched data into a setassociative buffer to avoid cache pollution. Abraham et al
[5] show that fewer than 10
instructions account for half the cache misses for six out of nine SPEC89 benchmarks.
Tyson et al [89] use this information to avoid caching the data accessed by such
instructions so as to improve cache utilization. Gonzalez et al [35] propose a dual data
cache with independent parts for managing data with spatial and temporal locality.
They also implement a lazy caching policy which tries not to cache anything, until a
benefit (in terms of spatial or temporal locality) can be predicted. The locality of a
171
data reference is predicted using a program counter (PC) based prediction table.
Lai et al [58] propose dead-block predictors (DBPs) that use a trace of memory
references to predict when a block in a data cache becomes evictable.
They also
propose dead-block correlating prefetchers (DBCPs) that uses address correlation in
conjunction with dead-block traces to predict a subsequent address to prefetch. Lui
et al [61] increase dead block prediction accuracy by predicting dead blocks based on
bursts of accesses to a cache block. These cache bursts are more predictable since
they hide the irregularity of individual references. The best performance optimizing
strategy is to then replace these dead blocks with prefetched blocks.
While the locality-aware coherence protocol is orthogonal to the dead-block eviction techniques, it differs from the above proposed selective caching techniques in the
following ways:
1. In prior selective caching schemes, the referenced data is always brought into the
core but is then placed in a set-associative buffer or discarded. On the contrary,
the locality-aware protocol selectively decides to move a cache line from the
shared LLC to the private cache or simply accesses the requested word at the
shared LLC, thereby reducing the energy consumption of moving unnecessary
data through the on-chip network.
2. All prior selective caching proposals have focused on uniprocessors, while this
thesis targets large-scale shared memory multicores running multithreaded applications with shared data.
In addition to the traditional private data, the
locality-aware protocol also tracks the locality of shared data and potentially
converts expensive invalidations into much cheaper word accesses that not only
improve memory latency but also reduces the energy consumption of the network and cache resources.
3. Prior schemes use a program counter (PC) based prediction strategy for selective caching. This may lead to inaccurate classifications and is insufficient for
programs that want to cache a subset of its working set in the cache. On the
172
contrary, the locality-aware protocol works at the fine granularity of cache lines
and thereby avoids the above shortcomings.
8.4
Remote Access
Remote access has been used as the sole mechanism to support shared memory in
multicores [92, 41, 31]. Remote access in coordination with intelligent cache placement [31] has been proposed as an alternative to cache coherence.
Remote store
programming (RSP) has been used to increase the performance of certain HPC applications
[41]
by placing data at the consumers. This is beneficial because processor
performance is more sensitive to load latency than store latency (since stores can be
hidden using the store queue). Software-controlled asynchronous remote stores [63]
has been proposed to reduce the communication overhead of synchronization variables
such as locks, barriers and presence flags.
Recently, research proposals that utilize remote access as an auxiliary mechanism
[71, :5], have demonstrated improvement in performance and energy consumption.
In this thesis, we observe that complex cores that support popular memory models
(e.g., x86 TSO, ARM, SC) need novel mechanisms to benefit from these adaptive
protocols.
8.5
Data Placement and Migration
Previous works have tackled data placement and migration either using compiler
techniques or a combination of micro-architectural and operating system techniques.
Yemliha et al
[95,
98] discuss compiler-directed code and data placement algo-
rithms that build a CDAG (Code Data Affinity Graph) offline and use the edge
weights of the graph to perform placement. These techniques are limited to benchmarks whose access patterns can be determined statically and cannot accommodate
dynamic behavior.
Kim et al [;)0] first proposed non-uniform caches (NUCA) for uniprocessors. They
173
propose two variants: Static-NUCA (SNUCA) and Dynamic-NUCA (DNUCA). While
SNUCA statically interleaves cache blocks across banks, DNUCA migrates high locality blocks to the bank closest to the cache controller. DNUCA uses parallel multicast
or sequential search to access data starting from the closest bank to the farthest
bank. Chishti et al [23] propose NuRAPID (Non-Uniform access with Replacement
And Placement usIng Distance associativity) to reduce the energy consumption in
NUCA. It decouples the tag array from the data array, using a central tag array to
place the frequently accessed data in the fastest subarrays, with fewer swaps than
DNUCA, while placing rarely accessed data in the farthest subarrays. It exploits the
fact that tag and data access are performed sequentially in a large last-level cache.
Beckmann et al [11] propose CMP-SNUCA and CMP-DNUCA that extend these
cache architectures to CMPs.
They observe that most of the L2 hits are made to
shared blocks, and hence block migration in CMP-DNUCA does not prove to be
very useful since it ends up placing the shared block at the center of gravity of the
requesters (not close to any requester). In addition, block migration necessitates a
2-phase multicast search algorithm to locate the L2 cache bank that holds a particular
cache block. This search algorithm degrades the performance of CMP-DNUCA even
below that of CMP-SNUCA in many applications.
The authors also evaluate the
benefits of strided prefetching with CMP-SNUCA (static interleaving) and find that
to be greater than that of block migration with CMP-DNUCA.
CMP-NuRAPID [24] (discussed earlier) uses Controlled Replication to force just
one cache line copy for read-write shared data and forces write-through to maintain
coherence for such lines using a special Communication (C) coherence state.
The
drawback again is that the size of per-core private tag arrays do not scale with the
number of cores.
Cho et al [251 propose using page-table and TLB entries to perform data placement. However, they leave it to future work to perform intelligent page allocation
policies. Reactive-NUCA [39] uses the above technique to place private data local
to the requesting core and statically interleave shared data in different slices of the
shared L2 cache. The only overhead is that when the classification of a page turns
174
from private to shared, the existing cache copies of the page at its first requester have
to be purged.
Dynamic directories [28] adapt the same placement and migration policies of
Reactive-NUCA for directory entries with a few modifications.
Directory entries
are not placed for private data. For shared data, directory entries are statically interleaved using a hash function of the address.
In this thesis, the placement and migration algorithms of Reactive-NUCA are
reused. However, shared data and instructions may be replicated in an L2 cache slice
close to the requester as described earlier.
8.6
Cache Replacement Policy
Qureshi et al [75] divide the problem of cache replacement into two parts: victim
selection policy and insertion policy. The victim selection policy decides which line
gets evicted for storing an incoming line, whereas, the insertion policy decides where
in the replacement list the incoming line is placed. Their proposed replacement policy
is motivated by the observation that when the working set is greater than the cache
size, the traditional LRU replacement policy causes all installed lines to have poor
temporal locality.
The optimal replacement policy should retain some fraction of
the working set long enough so that at least that fraction of the set provides cache
hits. They propose the Bimodal Insertion Policy which inserts the majority of the
incoming cache lines in the LRU position and a few in the MRU position. This reduces
cache thrashing in workloads where the working set is greater than the cache size. The
incoming lines that are placed in the LRU position get promoted to the MRU position
only if they get re-referenced. They also propose the Dynamic Insertion Policy which
switches between the Bimodal Insertion Policy (for LRU-unfriendly workloads) and
the traditional LRU policy (for LRU-friendly workloads) at an application-level based
on dynamic behavior using Set Dueling Monitors (SDMs). The victim selection policy
still remains the same as in the traditional LRU scheme.
Jaleel et al
[45]
modify the above policy for multiprogrammed workloads and pro175
pose Thread Aware Dynamic Insertion Policy (TADIP) which is cognizant of multiple
workloads running concurrently and accessing the same shared cache. They propose
using the Bimodal Insertion Policy for streaming applications as well as cache thrashing applications while using the traditional LRU Insertion Policy for recency friendly
applications.
Jaleel et al
[46] further propose cache replacement using re-reference interval pre-
diction (RRIP) to make a cache scan-resistant in addition to preserving its thrashresistant property using TADIP. Scans are access patterns where a burst of references
to non-temporal data discards the active working set from the cache. RRIP operates
using an M-bit register per cache block to store its re-reference interval prediction
value (RRPV) in order to adapt its replacement policy for data cached by scans.
This thesis applies different techniques to remotely cache low-locality data that
participates in thrashing and scans at the shared LLC location. Controlled replication
is also used to achieve the optimal data locality and miss rate for the L2 cache.
8.7
Cache Partitioning / Cooperative Cache Management
With the introduction of multicore processors, it is now possible to execute multiple
applications concurrently on the same processor. This makes last-level cache (LLC)
management complicated because these application may have varying cache demands.
While shared caches enable effective capacity sharing, i.e., applications can use cache
capacity based on demand, private caches constrain the capacity available to an application to the size of the cache on the core it is executing. This is bad for concurrently
executing applications that have varying working set sizes. On the other hand, private caches inherently provide performance isolation so a badly behaving application
cannot hurt the performance of other concurrently executing applications.
Several
recent proposals exist that combine the benefit of shared and private caches. While
cache partitioning schemes 176, 79] divide up the shared cache among multiple ap-
176
plications so as to optimize throughput and fairness, co-operative cache management
schemes [22, 74, 40, 77] start with a private cache organization and transfers evicted
lines to other private caches (spilling) to exploit cache capacity.
Utility-based cache partitioning [76] uses way-partitioning and allocates varying
number of ways of a cache to different applications. The number of ways allocated
are decided by measuring the miss rate of an application as a function of the number
of ways allocated to it. This measurement is done using hardware monitoring circuits
based on dynamic set sampling. However, way partitioning decreases the cache associativity available to an application and is only useful when the number of ways of
the cache is greater than the number of partitions. Vantage
[79]
removes the ineffi-
ciencies of way-partitioning using a highly-associative cache (ZCache
[78])
and relying
on statistical analysis to provide strong guarantees and bounds on associativity and
sizing. Also, they improve performance by partitioning only 90% of the cache while
retaining 10% unpartitioned for dynamic application needs.
Cooperative caching [22] starts with a private LLC baseline to minimize hit latency
but spills evicted lines into the caches of adjacent cores to enable better capacity
sharing. However, this scheme does not take into account the utility of the spilled
cache lines and can work very badly with streaming applications as well as applications
with huge working set sizes. Adaptive spill-receive
[74]
marks each private cache as a
spiller or receiver (which receives spilled cache lines). The configuration of a cache is
decided based on set dueling. Set dueling allocates a few sets of each cache to always
spill or always receive and calculates the cache miss rate incurred by these sets. The
rest of the sets are marked as spill or receive based on which does better. This scheme
does not scale beyond a few cores because of the inherent bottlenecks in set dueling.
Adaptive set-granular cooperative caching
[77]
marks each set of a cache as spill or
receiver or neutral, thus enabling cooperative caching at a very fine-grained level. A
set marked as neutral does not spill lines or receive spilled lines. The cache to spill to
is decided once per set using broadcast and its location is recorded locally for future
spills. However, this scheme cannot allocate more than twice the private cache size
to an application.
177
The locality-aware adaptive cache coherence and data replication techniques are
orthogonal to the above proposals.
However, for data replication, the monitoring
circuit used by utility-based cache partitioning is adapted in one of the methods to
measure the miss rate incurred by alternate replication cluster sizes.
8.8
Memory Consistency Models
Speculatively relaxing memory order was proposed with load speculation in MIPS
R10K
[941. Full speculation for SC execution has been studied as well 134, 84]. In-
visiFence [15], Store-Wait-Free [91] and BulkSC
119] are proposals that attempt to
accelerate existing memory models by reducing both buffer capacity and ordering
related stalls.
Timestamps have been used to implement memory models [30,
coherence verification
[731.
85] and for cache
This thesis, however, is the first to identify and solve the
problems associated with using remote accesses that make data access efficient in
large scale multicores.
8.9
On-Chip Network and DRAM Performance
Memory bottlenecks in multicores could be alleviated using newer technologies such
as embedded DRAM [69] and 3D stacking [62] as well as by using intelligent memory scheduling techniques [8]. Network bottlenecks could be reduced by using newer
technologies such as photonics [56 as well as by using better topologies [82], router
bypassing [511, adaptive routing schemes [27J and intelligent flow control [27] techniques.
Both memory and network bottlenecks could also be alleviated using intelligent
cache hierarchy management. Better last-level cache (LLC) partitioning [76, 12] and
replacement schemes [75, 46] have been proposed to reduce memory pressure. Better
cache replication [39, 97, 10], placement [39, 12] and allocation [04] schemes have been
proposed to exploit application data locality and reduce network traffic. Our proposed
178
extension for locality-aware cache coherence ensures that it can be implemented in
multicore processors with popular memory models alongside any of the above schemes.
179
180
Chapter 9
Conclusion
Applications exhibit varying reuse behavior towards cache lines and this behavior can
be exploited to improve performance and energy efficiency through selective replication of cache lines. Replicating high reuse lines improves access latency and energy
while suppressing the replication of low reuse lines reduces data movement and improves cache utilization. No correlation between reuse and cache line type (e.g., private data, shared read-only and read-write data) for data access has been observed
experimentally. (Instructions however, have good reuse.) Hence, a replication policy
based on data classification does not produce optimal results.
The varying reuse behavior towards cache lines has been observed at both the Li
and L2 cache levels. This enables a reuse-based replication scheme to be applied to
all levels of a cache hierarchy. Exploiting variable reuse behavior at the Li cache also
enables the transfer of just the requested word to/from the remote L2 cache location.
This is more energy-efficient than transferring the entire contents of a low-reuse cache
line. Such a data access technique is called 'remote access'.
In processors with good single-thread performance, it is important to use speculation and prefetching to retain load/store performance when strong memory consistency models are required.
State-of-the-art processors rely on cache coherence
messages (invalidations /evictions) to detect speculation violations.
However, such
coherence messages are avoided with remote accesses, necessitating an alternate technique to detect violations. This thesis proposes a novel timestamp-based technique
181
to detect speculation violations. This technique can be applied to loads that access
data directly at the private L1-D cache as well, obviating the need for cache coherence
messages to detect violations. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have
temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps
to be stored only for a small time window. The timestamp-based validation technique
was found to produce results close to that of an ideal scheme that does not suffer from
speculation violations.
Providing scalable cache coherence is of paramount importance to computer architects since it enables to preserve the familiar programming paradigm of shared
memory in multicore processors with ever-increasing core counts.
This thesis pro-
poses the ACKwise protocol that provides scalable cache coherence in co-ordination
with a network with broadcast support. Simple changes to the routing protocol of
a mesh network are proposed to support broadcasts. No virtual channels are added.
ACKwise also supports high degrees of read sharing without any overheads.
ACKwise works in synergy with the locality-aware cache replication schemes. The
locality-aware schemes prevent replication of low-reuse data, thereby reducing its
number of private sharers. This potentially reduces the number of occasions when
invalidation broadcasts are required. Employing broadcasts for high-reuse data does
not harm efficiency since such data has a large lifetime in the cache.
The principal thesis contributions are summarized next and later, opportunities
for future work are discussed.
9.1
Thesis Contributions
This thesis makes the following five important contributions that holistically address
performance, energy efficiency and programmability.
1. Proposes a scalable limited directory-based coherence protocol, ACKwise [56,
W] that reduces the directory storage needed to track the sharers of a data
block.
182
2. Proposes a Locality-aware Private Cache Replication scheme [54 to better manage the private caches in multicore processors by intelligently controlling data
caching and replication.
3. Proposes a Timestamp-based Memory Ordering Validation technique that enables the preservation of familiar memory consistency models when the intelligent private cache replication scheme is applied to production processors.
4. Proposes a Locality-aware LLC Replication scheme [53] that better manages the
last-level shared cache (LLC) in multicore processors by balancing shared data
(and instruction) locality and off-chip miss rate through controlled replication.
5. Proposes a Locality-aware Cache Hierarchy Management Scheme that seamlessly combines all the above schemes to provide an optimal combination of
data locality and miss rate at all levels of the cache hierarchy.
On a 64-core multicore processor with out-of-order cores, Locality-aware Cache
Hierarchy Replication improves completion time by 15% and energy by 22% while
incurring a storage overhead of 30.5 KB per core (i.e., ~10% the aggregate cache
capacity of each core).
9.2
Future Directions
There are three potential future directions for this work. The first combines hardware
and software techniques to perform intelligent replication.
The second reduces the
storage overhead of the locality-aware replication schemes by potentially exploiting
the correlation between the reuse exhibited by cache lines belonging to the same page
or instruction address. The third reduces the dependence on the timestamp-based
scheme by exploring an optimized variant of the non-conflicting scheme described
earlier.
This scheme works by classifying pages into private, shared read-only and
shared read-write on a time interval basis, allowing back transitions from read-write
to read-only and shared to private.
These potential future directions are discussed
below.
183
9.2.1
Hybrid Software/Hardware Techniques
The programmer could designate certain data structures or program code as having
potential benefits if replicated. The hardware would then be responsible for enforcing
these software hints. In this case, the intelligence to decide which cache lines should be
replicated is delegated to software. However, the hardware would still be required to
implement the various schemes in a deadlock-free manner so that starvation freedom
and forward progress can be ensured.
9.2.2
Classifier Compression
Compression techniques to reduce the storage overhead needed for the classifier need
to be explored. One method is to explore the correlation between the reuse of cache
lines that belong to the same page or instruction. If such a correlation exists, then
locality tracking structures needs to be maintained only on a per-page or a perinstruction basis.
9.2.3
Optimized Variant of Non-Conflicting Scheme
One simple way to implement a memory consistency model is to enable loads/store
to non-conflicting data (i.e., private, and shared read-only data) to be issued and
completed out-of-order.
If most data accesses are to private and shared read-only
data, such an implementation would be efficient. This requires an efficient classifier
that dynamically transitions pages from shared to private and read-write to read-only
when the opportunity arises. One way to achieve this is to classify pages on a time
interval basis and reset the assigned labels periodically.
184
Bibliography
[1]
The sparc architecture manual, v. 8. spare international, inc. http: //www.
sparc.org/standards/V8.pdf, 1992.
[2] First the tick, now the tock: Next generation Intel microarchitecture (Nehalem).
White Paper, 2008.
[3] DARPA UHPC Program (DARPA-BAA-10-37), March 2010.
exasfor
CPU
Landing
Knights
x86
72-core
unveils
[4] Intel
http://www.extremetech.com/extreme/
supercomputing.
cale
171678-intel-unveils-72-core-x86-knights-landing-cpu-for-exascale/
supercomputing, November 2013.
[5] Santosh G. Abraham, Rabin A. Sugumar, Daniel Windheiser, B. R. Rau, and
Rajiv Gupta. Predictability of load/store instruction latencies. In Proceedings
of the 26th annual internationalsymposium on Microarchitecture, MICRO 26,
pages 139-152, Los Alamitos, CA, USA, 1993. IEEE Computer Society Press.
[6] Anant Agarwal, Richard Simoni, John L. Hennessy, and Mark Horowitz. An
Evaluation of Directory Schemes for Cache Coherence. In International Symposium on Computer Architecture, 1988.
[7] Jade Alglave, Daniel Kroening, Vincent Nimal, and Daniel Poetzl. Don't sit
on the fence: A static analysis approach to automatic fence insertion. CoRR,
abs/1312.1411, 2013.
[81 Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian,
Gabriel H. Loh, and Onur Mutlu. Staged memory scheduling: Achieving high
performance and scalability in heterogeneous systems. In Proceedings of the
39th Annual International Symposium on Computer Architecture, ISCA '12,
pages 416-427, Washington, DC, USA, 2012. IEEE Computer Society.
[9] Luiz Andr6 Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas
Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben
Verghese. Piranha: a scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th annual international symposium on Computer
architecture, ISCA '00, pages 282-293, New York, NY, USA, 2000. ACM.
185
[101 Bradford M. Beckmann, Michael R. Marty, and David A. Wood. ASR: Adaptive Selective Replication for CMP Caches. In Proceedings of the 39th Annual
IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 39, pages
443-454, 2006.
[11] Bradford M. Beckmann and David A. Wood. Managing wire delay in large
chip-multiprocessor caches. In Proceedings of the 37th annual IEEE/ACM InternationalSymposium on Microarchitecture,MICRO 37, pages 319-330, Washington, DC, USA, 2004. IEEE Computer Society.
[121 Nathan Beckmann and Daniel Sanchez. Jigsaw: Scalable Software-defined
Caches. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 213-224, Piscataway,
NJ, USA, 2013. IEEE Press.
[131 S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,
M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey,
D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. Tile64 - processor: A 64-core soc with mesh
interconnect. In InternationalSolid-State Circuits Conference, 2008.
114] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In International Conference on Parallel Architectures and Compilation Techniques,
2008.
[15] Colin Blundell, Milo M. K. Martin, and Thomas F. Wenisch. Invisifence:
Performance-transparent memory ordering in conventional multiprocessors. In
In Proc. 36th Intl. Symp. on Computer Architecture, 2009.
[161 S. Borkar. Panel on State-of-the-art Electronics. NSF Workshop on Emerging
Technologies for Interconnects http: //weti. cs. ohiou. edu/, 2012.
[171 Shekhar Borkar. Thousand core chips: a technology perspective.
Automation Conference, 2007.
In Design
[181 L. M. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Trans. Comput., 27(12):1112-1118, December 1978.
[19] Luis Ceze, James Tuck, Pablo Montesinos, and Josep Torrellas. BulkSC: Bulk
Enforcement of Sequential Consistency. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 278-289,
New York, NY, USA, 2007. ACM.
[20] David Chaiken, Craig Fields, Kiyoshi Kurihara, and Anant Agarwal. Directorybased cache coherence in large-scale multiprocessors. Computer, 23(6):49-58,
June 1990.
186
[211 David Chaiken, John Kubiatowicz, and Anant Agarwal. LimitLESS Directories:
A Scalable Cache Coherence Scheme. In Proceedings of the Fourth International
Conference on Architectural Support for ProgrammingLanguages and Operating
Systems, ASPLOS IV, pages 224-234, New York, NY, USA, 1991. ACM.
[22] Jichuan Chang and Gurindar S. Sohi. Cooperative Caching for Chip Multipro-
cessors. In ISCA, 2006.
[23] Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In
Proceedings of the 36th annual IEEE ACM InternationalSymposium on Microarchitecture, MICRO 36, pages 55-, Washington, DC, USA, 2003. IEEE
Computer Society.
[24] Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in cmps. In Proceedings of
the 32nd annual internationalsymposium on Computer Architecture, ISCA '05,
pages 357-368, 2005.
[251 Sangyeun Cho and Lei
os-level page allocation.
national Symposium on
ington, DC, USA, 2006.
Jin. Managing distributed, shared 12 caches through
In Proceedings of the 39th Annual IEEE ACM InterMicroarchitecture, MICRO 39, pages 455-468, WashIEEE Computer Society.
[261 Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill
Hughes. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE Micro, 30(2), March 2010.
[27]
William Dally and Brian Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
[281 Abhishek Das, Matthew Schuchhardt, Nikos Hardavellas, Gokhan Memik, and
Alok N. Choudhary. Dynamic directories: A mechanism for reducing on-chip
interconnect power in multicores. In DATE, pages 479-484, 2012.
[291 Noel Eisley, Li-Shiuan Peh, and Li Shang. In-network cache coherence. In
IEEE ACM InternationalSymposium on Microarchitecture,MICRO 39, pages
321-332, 2006.
[30]
M. Elver and V. Nagarajan. Tso-cc: Consistency directed cache coherence for
tso. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th
InternationalSymposium on, pages 165-176, Feb 2014.
[31] C. Fensch and M. Cintra. An os-based alternative to full hardware coherence
on tiled cmps. In High Performance Computer Architecture, 2008. HPCA 2008.
IEEE 14th InternationalSymposium on, pages 355-366, 2008.
187
132]
Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. Cuckoo
directory: A scalable directory for many-core systems. In 17th IEEE International Symposium on High Performance Computer Architecture (HPCA), pages
169-180, 2011.
[331 Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Two techniques to
enhance the performance of memory consistency models. In In Proceedings of
the 1991 InternationalConference on ParallelProcessing, pages 355-364, 1991.
[341 Chris Gniady, Babak Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In
Proceedings of the 26th Annual InternationalSymposium on Computer Architecture, ISCA '99, pages 162-171, Washington, DC, USA, 1999. IEEE Computer
Society.
[35] Antonio Gonzdlez, Carlos Aliagas, and Mateo Valero. A data cache with multiple caching strategies tuned to different types of locality. In Proceedings of the
9th internationalconference on Supercomputing, ICS '95, pages 338-347, New
York, NY, USA, 1995. ACM.
[36] Peter Greenhalgh. big.little processing with arm cortex-a15 & cortex-a7. http:
//www.arm.com/files/downloads/bigLITTLEFinalFinal.pdf, 2011.
[37] Anoop Gupta, Wolf dietrich Weber, and Todd Mowry. Reducing memory and
traffic requirements for scalable directory-based cache coherence schemes. In In
InternationalConference on Parallel Processing, pages 312-321, 1990.
[38] P. Hammarlund, AJ. Martinez, AA Bajwa, D.L. Hill, E. Hallnor, Hong Jiang,
M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R.B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther,
T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor.
Micro, IEEE, 34(2):6-20, Mar 2014.
[39] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki.
Reactive NUCA: Near-optimal Block Placement and Replication in Distributed
Caches. In Proceedings of the 36th Annual InternationalSymposium on Com-
puter Architecture, ISCA '09, pages 184-195, New York, NY, USA, 2009. ACM.
[40] Enric Herrero, Jos6 GonzAlez, and Ramon Canal. Elastic cooperative caching:
an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In Proceedings of the 37th annual internationalsymposium on Computer
architecture, ISCA '10, pages 419-428, New York, NY, USA, 2010. ACM.
[41] Henry Hoffmann, David Wentzlaff, and Anant Agarwal. Remote store programming: A memory model for embedded multicore. In Proceedings of the
5th InternationalConference on High Performance Embedded Architectures and
Compilers, HiPEAC'10, pages 3-17, Berlin, Heidelberg, 2010. Springer-Verlag.
188
[42] Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, and
Stephen W. Keckler. A NUCA substrate for flexible CMP cache sharing. In
Proceedings of the 19th annual internationalconference on Supercomputing, ICS
'05, pages 31-40, New York, NY, USA, 2005. ACM.
[431 S.M.Z. Iqbal, Yuchen Liang, and H. Grahn. ParMiBench - An Open-Source
Benchmark for Embedded Multiprocessor Systems. Computer Architecture Let-
ters, 9(2):45 -48, feb. 2010.
[44] Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel
Emer. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (TLA) cache management policies. In Int'l Symposium, on
Microarchitecture, 2010.
[45] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon
Steely, Jr., and Joel Emer. Adaptive insertion policies for managing shared
caches. In Proceedings of the 17th internationalconference on Parallel architectures and compilation techniques, PACT '08, pages 208-219, 2008.
[46] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer.
High Performance Cache Replacement Using Re-reference Interval Prediction
(RRIP). In Proceedings of the 37th Annual InternationalSymposium on Computer Architecture, ISCA '10, pages 60-71, New York, NY, USA, 2010. ACM.
[47] Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti. Virtual circuit tree
multicasting: A case for on-chip hardware multicast support. In Int'l Symposium on Computer Architecture, 2008.
[48] Norman P. Jouppi. Improving direct-mapped cache performance by the addition
of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th
annual international symposium on Computer Architecture, ISCA '90, pages
364-373, 1990.
[49] Khakifirooz, A. and Nayfeh, O.M. and Antoniadis, D. A Simple Semiempirical Short-Channel MOSFET Current-Voltage Model Continuous Across All
Regions of Operation and Employing Only Physical Parameters. Electron Devices, IEEE Transactions on, 56(8):1674 -1680, aug. 2009.
[50] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, nonuniform cache structure for wire-delay dominated on-chip caches. In Proceedings
of the 10th internationalconference on Architectural support for programming
languages and operating systems, ASPLOS X, pages 211-222, New York, NY,
USA, 2002. ACM.
[51] Tushar Krishna, Chia-Hsin Owen Chen, Sunghyun Park, Woo Cheol Kwon,
Suvinay Subramanian, Anantha Chandrakasan, and Li-Shiuan Peh. Single-cycle
multihop asynchronous repeated traversal: A smart future for reconfigurable
on-chip networks. Computer, 46(10):48-55, October 2013.
189
[52] Amit Kumar, Partha Kundu, Arvind P. Singh, Li-Shiuan Peh, and Niraj K.
Jha. A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switch allocator
in 65nm cmos. In International Conference on Computer Design, 2007.
[53]
G. Kurian, S. Devadas, and 0. Khan. Locality-aware data replication in the
last-level cache. In High Performance Computer Architecture (HPCA), 2014
IEEE 20th InternationalSymposium on, pages 1-12, Feb 2014.
[54] George Kurian, Omer Khan, and Srinivas Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International
Symposium on Computer Architecture, ISCA '13, pages 523-534, New York,
NY, USA, 2013. ACM.
[55] George Kurian, Omer Khan, and Srinivas Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International
Symposium on Computer Architecture, ISCA '13, pages 523-534, New York,
NY, USA, 2013. ACM.
[56] George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu,
Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. ATAC: A 1000-core
Cache-coherent Processor with On-chip Optical Network. In Proceedings of
the 19th International Conference on Parallel Architectures and Compilation
Techniques, PACT '10, pages 477-488, New York, NY, USA, 2010. ACM.
[57] George Kurian, Chen Sun, Chia-Hsin Owen Chen, Jason E. Miller, Jurgen
Michel, Lan Wei, Dimitri A. Antoniadis, Li-Shiuan Peh, Lionel Kimerling,
Vladimir Stojanovic, and Anant Agarwal. Cross-layer energy and performance
evaluation of a nanophotonic manycore processor system using real application
workloads. Parallel and Distributed Processing Symposium, 0:1117-1130, 2012.
[58] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-block prediction & deadblock correlating prefetchers. In Proceedings of the 28th annual international
symposium on Computer architecture, ISCA '01, pages 144-154, New York, NY,
USA, 2001. ACM.
[59] L. Lamport. How to make a multiprocessor computer that correctly executes
multiprocess programs. IEEE Trans. Comput., 28(9):690-691, September 1979.
[60] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen,
and Norman P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In International
Symposium on Microarchitecture,2009.
[61] Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. Cache bursts:
A new approach for eliminating dead blocks and increasing cache efficiency. In
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture,MICRO 41, pages 222-233, 2008.
190
[62] G.H. Loh. 3d-stacked memory architectures for multi-core processors. In Computer Architecture, 2008. ISCA '08. 35th International Symposium on, pages
453-464, June 2008.
[63] Meredydd Luff and Simon Moore. Asynchronous remote stores for interprocessor communication. In Future Architectural Support for Parallel Pro-
gramming (FASPP), 2012.
[641 Yeong-Chang Maa, Dhiraj K. Pradhan, and Dominique Thiebaut. Two economical directory schemes for large-scale cache coherent multiprocessors. SIGARCH
Comput. Archit. News, 19(5):10-, September 1991.
[65] Sela Mador-Haim, Luc Maranget, Susmit Sarkar, Kayvan Memarian, Jade Alglave, Scott Owens, Rajeev Alur, Milo M. K. Martin, Peter Sewell, and Derek
Williams. An axiomatic memory model for power multiprocessors. In Proceedings of the 24th International Conference on Computer Aided Verification,
CAV'12, pages 495-512, Berlin, Heidelberg, 2012. Springer-Verlag.
[66] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. Why on-chip cache
coherence is here to stay. Commun. ACM, 55(7):78-89, 2012.
[671 Scott McFarling. Cache replacement with dynamic exclusion. In Proceedings of
the 19th annual internationalsymposium on Computer architecture, ISCA '92,
pages 191-200, New York, NY, USA, 1992. ACM.
[68] Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III,
Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal.
Graphite: A Distributed Parallel Simulator for Multicores. In International
Symposium on High-Performance Computer Architecture, 2010.
[69] Sparsh Mittal, Jeffrey S. Vetter, and Dong Li. Improving energy efficiency of
embedded dram caches for high-end computing systems. In Proceedings of the
23rd International Symposium on High-performance Parallel and Distributed
Computing, HPDC '14, pages 99-110, New York, NY, USA, 2014. ACM.
[70] Brian W. O'Krafka and A. Richard Newton. An empirical evaluation of two
memory-efficient directory methods. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 138-147, New
York, NY, USA, 1990. ACM.
[71] Jongsoo Park, Richard M. Yoo, Daya S. Khudia, Christopher J. Hughes, and
Daehyun Kim. Location-aware cache management for many-core processors
with deep cache hierarchy. In Proceedings of SC13: International Conference
for High Performance Computing, Networking, Storage and Analysis, SC '13,
pages 20:1-20:12, New York, NY, USA, 2013. ACM.
[72]
Dadi Perlmutter. Introducing next generation low power microarchitecture:
Silvermont. http: //files. shareholder. com/downloads/INTC/0x0x660894/
191
f3398730-60e7-44bc-a92f-9a9a652f11c9/2013_SilvermontFINAL-Mon_
Harbor.pdf, 2013.
[73] Manoj Plakal, Daniel J. Sorin, Anne E. Condon, and Mark D. Hill.
Lam-
port clocks: Verifying a directory cache-coherence protocol. In Proceedings of
the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures,
SPAA '98, pages 67-76, New York, NY, USA, 1998. ACM.
[74] M.K. Qureshi. Adaptive spill-receive for robust high-performance caching in
cmps. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE
15th InternationalSymposium on, pages 45 -54, feb. 2009.
175] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel
Emer. Adaptive Insertion Policies for High Performance Caching. In Proceedings
of the 34th Annual InternationalSymposium on Computer Architecture, ISCA
'07, pages 381-391, New York, NY, USA, 2007. ACM.
[76] Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: A
low-overhead, high-performance, runtime mechanism to partition shared caches.
In Proceedings of the 39th Annual IEEE/ACM International Symposium on
Microarchitecture, MICRO 39, pages 423-432, Washington, DC, USA, 2006.
IEEE Computer Society.
[77] Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo. Adaptive set-granular
cooperative caching. In Proceedings of the 2012 IEEE 18th InternationalSymposium on High-Performance Computer Architecture, HPCA '12, pages 1-12,
Washington, DC, USA, 2012. IEEE Computer Society.
[781 Daniel Sanchez and Christos Kozyrakis. The ZCache: Decoupling Ways and
Associativity. In Proceedings of the 2010 43rd Annual IEEE ACM International
Symposium on Microarchitecture,MICRO '43, pages 187-198, Washington, DC,
USA, 2010. IEEE Computer Society.
[79] Daniel Sanchez and Christos Kozyrakis. Vantage: scalable and efficient finegrain cache partitioning. In ISCA, pages 57-68, 2011.
[80] Daniel Sanchez and Christos Kozyrakis. SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding. In International Symp. on HighPerformance Computer Architecture, 2012.
[81] Daniel Sanchez and Christos Kozyrakis. Zsim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th Annual
InternationalSymposium on Computer Architecture, ISCA '13, pages 475-486,
New York, NY, USA, 2013. ACM.
[82] Daniel Sanchez, George Michelogiannakis, and Christos Kozyrakis. An Analysis of On-chip Interconnection Networks for Large-scale Chip Multiprocessors.
ACM Trans. Archit. Code Optim., 7(1):4:1-4:28, May 2010.
192
[83] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus 0. Myreen. X86-tso: A rigorous and usable programmer's model for x86
multiprocessors. Commun. ACM, 53(7):89-97, July 2010.
[841 Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd Millstein, and
Madanlal Musuvathi. End-to-end sequential consistency. In Proceedings of the
39th Annual International Symposium. on Computer Architecture, ISCA '12,
pages 524-535, Washington, DC, USA, 2012. IEEE Computer Society.
[85] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor,
and Tor M. Aamodt. Cache coherence for gpu architectures. In Proceedings of
the 2013 IEEE 19th InternationalSymposium on High Performance Computer
Architecture (HPCA), HPCA '13, pages 578-590, Washington, DC, USA, 2013.
IEEE Computer Society.
[86] Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures in Computer Architecture,
Morgan Claypool Publishers, 2011.
[87] Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant
Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip
Modeling. In InternationalSymposium on Networks-on-Chip, 2012.
[88] Sun Microsystems. UltraSPARC T2 supplement to the UltraSPARC architecture 2007. Technical Report, 2007.
[89]
Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. A
modified approach to data cache management. In InternationalSymposium on
Microarchitecture,pages 93-103, 1995.
[90] Lan Wei, F. Boeuf, T. Skotnicki, and H.-S.P. Wong. Parasitic Capacitances:
Analytical Models and Impact on Circuit-Level Performance. Electron Devices,
IEEE Transactions on, 58(5):1361 -1370, may 2011.
[91] Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas
Moshovos. Mechanisms for store-wait-free multiprocessors. In Proceedings of
the 34th Annual InternationalSymposium on Computer Architecture, ISCA '07,
pages 266-277, New York, NY, USA, 2007. ACM.
[921
David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards,
Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and
Anant Agarwal. On-chip interconnection architecture of the tile processor.
IEEE Micro, 27(5):15-31, 2007.
[93] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and
Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological
Considerations. In InternationalSymposium on Computer Architecture, 1995.
193
[94] K.C. Yeager.
The Mips R10000 superscalar microprocessor.
Micro, IEEE,
16(2):28-41, Apr 1996.
[95] T. Yemliha, S. Srikantaiah, M. Kandemir, M. Karakoy, and M.J. Irwin. Integrated code and data placement in two-dimensional mesh based chip multiprocessors. In Computer-Aided Design, 2008. ICCAD 2008. IEEE ACM International Conference on, pages 583 -588, nov. 2008.
[96] Jason Zebchuk, Vijayalakshmi Srinivasan, Moinuddin K. Qureshi, and Andreas
Moshovos. A tagless coherence directory. In Proceedings of the 42Nd Annual
IEEE ACM InternationalSymposium on Microarchitecture,MICRO 42, pages
423-434, New York, NY, USA, 2009. ACM.
[97] Michael Zhang and Krste Asanovic. Victim replication: Maximizing capacity
while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32Nd
Annual International Symposium on Computer Architecture, ISCA '05, pages
336-345, Washington, DC, USA, 2005. IEEE Computer Society.
[98] Yuanrui Zhang, Wei Ding, Mahmut Kandemir, Jun Liu, and Ohyoung Jang.
A Data Layout Optimization Framework for NUCA-based Multicores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pages 489-500, New York, NY, USA, 2011. ACM.
[99]
Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. SPACE: sharing pattern-based directory coherence for multicore scalability. In International
Conference on Parallel Architectures and Compilation Techniques, pages 135146, 2010.
[100] Hongzhou Zhao, Arrvindh Shriraman, Sandhya Dwarkadas, and Vijayalakshmi
Srinivasan. SPATL: Honey, I Shrunk the Coherence Directory. In International
Conference on ParallelArchitectures and Compilation Techniques, 2011.
194
Download