Locality-aware Cache Hierarchy Management for ARCHIVES Multicore Processors MASSACHUSETTS INTIT(JTE OF rECH4N0LOL(_Y lIAR 19 20151 by George Kurian B.Tech., Indian Institute of Technology, Chennai (2008 S.M., Massachusetts Institute of Technology (2010) LIBRARIES Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2015 Massachusetts Institute of Technology 2015. All rights reserved. Signature redacted A u th or .............................................. Department of Electrical Engineering and Computer Science December 30, 2014 Signature redacted C ertified by ..................................... Srinivas Devadas Professor Thesis Supervisor Accepted by ............................. Signature redacted LAid A. Kolodziejski Chair, Department Committee on Graduate Students / Locality-aware Cache Hierarchy Management for Multicore Processors by George Kurian Submitted to the Department of Electrical Engineering and Computer Science on December 30, 2014, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Next generation multicore processors and applications will operate on massive data with significant sharing. A major challenge in their implementation is the storage requirement for tracking the sharers of data. The bit overhead for such storage scales quadratically with the number of cores in conventional directory-based cache coherence protocols. Another major challenge is limited cache capacity and the data movement incurred by conventional cache hierarchy organizations when dealing with massive data scales. These two factors impact memory access latency and energy consumption adversely. This thesis proposes scalable efficient mechanisms that improve effective cache capacity (i.e., by improving utilization) and reduce data movement by exploiting locality and controlling replication. First, a limited directory-based protocol, ACKwise is proposed to track the sharers of data in a cost-effective manner. ACKwise leverages broadcasts to implement scalable cache coherence. Broadcast support can be implemented in a 2-D mesh network by making simple changes to its routing policy without requiring any additional virtual channels. Second, a locality-aware replication scheme that better manages the private caches is proposed. This scheme controls replication based on data reuse information and seamlessly adapts between private and logically shared caching of on-chip data at the fine granularity of cache lines. A low-overhead runtime profiling capability to measure the locality of each cache line is built into hardware. Private caching is only allowed for data blocks with high spatio-temporal locality. Third, a Timestamp-based memory ordering validation scheme is proposed that enables the locality-aware private cache replication scheme to be implementable in processors with out-of-order memory that employ popular memory consistency models. This method does not rely on cache coherence messages to detect speculation violations, and hence is applicable to the locality-aware protocol. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of 3 each other), thus requiring timestamps to be stored only for a small time window. Fourth, a locality-aware last-level cache (LLC) replication scheme that better manages the LLC is proposed. This scheme adapts replication at runtime based on fine-grained cache line reuse information and thereby, balances data locality and off-chip miss rate for optimized execution. Finally, all the above schemes are combined to obtain a cache hierarchy replication scheme that provides optimal data locality and miss rates at all levels of the cache hierarchy. The design of this scheme is motivated by the experimental observation that both locality-aware private cache & LLC replication enable varying performance improvements across benchmarks. These techniques enable optimal use of the on-chip cache capacity, and provide low-latency, low-energy memory access, while retaining the convenience of shared memory and preserving the same memory consistency model. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% over a state-of-the-art baseline while incurring a storage overhead of 30.7 KB per core. (i.e., 10% the aggregate cache capacity of each core). Thesis Supervisor: Srinivas Devadas Title: Professor 4 Acknowledgments I would like to thank my advisors at MIT, Prof. Srinivas Devadas and Prof. Anant Agarwal for their enthusiastic support towards my research, both through technical guidance and funding. I would like to thank Prof. Devadas for his sharp intellect and witty demeanor which made group meetings fun and lively. I would like to thank him for sitting patiently through my many presentations and providing useful feedback that always made my talk better. I would also like to thank him for accepting to be my mentor midway through my PhD career when Prof. Agarwal had to leave to join EdX, and also allowing me to explore many research topics before finally deciding one worthy of serious consideration. I would like to thank Prof. Agarwal for his constant encouragement and lively personality which puts anyone who meets with him immediately at ease. I would like to thank him for all the presentation skills I have learnt from him which have enabled me to plan the best use of my time during a talk. I would also like to thank him for enabling me to get adjusted to MIT and USA during my initial days at graduate school. I would like to thank Prof. Omer Khan (University of Connecticut, Storrs) for constantly providing fuel for my research and helping me remain competitive. I would like to thank him for introducing me to many researchers at conferences. I would also like to thank him for being my partner in crime during numerous all-nighters before conference deadlines. I would like to thank Prof. Daniel Sanchez for being a member on my thesis committee and for all the feedback on the initial draft of my thesis. I would also like to thank him for sharing the ZSim simulator code from which I have extracted the processor decoder models used in my work. I would like to thank Jason Miller for all the discussions I had with him and for his willingness to explain technical details clearly, precisely and patiently. I would also like to thank him for the valuable feedback he gave for many presentations and for the many discussion sessions on the Graphite simulator. I would also like to thank 5 Nathan Beckmann, Harshad Kasture and Charles Gruenwald for the camaraderie and all the brainstorming sessions during the Graphite project. I would like to thank the Hornet group for all the informative presentations and discussions during the group meetings. I would like to thank Cree Bruins for all the support and for always being willing to help when needed. I would like to thank my close set of friends at Tang Hall for the friendship and support extended to me at MIT and for providing a venue for relaxation and enjoyment outside of research and courses. Last but not the least, I would like to thank my parents and my brother for the love and support extended to me throughout my life and for making me the person I am today. I would like to thank them for supporting my decision to go to graduate school and for tolerating my absence from home for most of the year. 6 Contents 21 1.1.1 Cache Management . . . . . . . . . . . . . . . . . . . . . . . 22 1.1.2 Single-core Performance . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . Programmability and Memory Models . . . . . . . . . . . . . . . . 24 1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 25 1.3.1 ACKwise Directory Coherence Protocol . . . . . . . . . . . . 25 1.3.2 Locality-aware Private Cache Replication [54] . . . . . . . . . 26 1.3.3 Timestamp-based Memory Ordering Validation . . . . . . . . 27 1.3.4 Locality-aware LLC Replication [53] . . . . . . . . . 28 1.3.5 Locality-aware Cache Hierarchy Replication . . . . . . . . . 29 1.3.6 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . 29 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 30 . . . 1.2 . . . . . . . 1.4 3 Performance and Energy Efficiency . . . . . . . . . . 1.1 31 Evaluation Methodology Baseline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Energy M odels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 ToolFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 34 . . . . 2.1 . 2 21 Introduction ACKwise Directory Coherence Protocol 3.1 Protocol Operation . . . . . . . . . . . . 1 7 37 38 Silent Evictions . . . . . . . . . . . . . . . . . . . 39 3.3 Electrical Mesh Support for Broadcast . . . . . . 39 3.4 Evaluation Methodology . . . . . . . . . . . . . . 41 3.5 R esults . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . 3.5.2 Sensitivity to Broadcast Support 3.5.3 Comparison to DirkNB [61 Organization Sum m ary (k) 41 . Sensitivity to Number of Hardware Pointers 44 . . . . . . . . . . . . 45 . . . . . . . . . . . . 46 . . . . . . . . . . . . . . . . . . . . . . . . 47 Locality-aware Private Cache Replication Motivation . . . . . . . . . . . . . . . . . . . 47 4.2 Protocol Operation . . . . . . . . . . . . . . 48 . . 4.1 Read Requests . . . . . . . . . . . . 50 4.2.2 Write Requests . . . . . . . . . . . . 52 4.2.3 Evictions and Invalidations . . . . . . 52 . . . 4.2.1 Predicting Remote-+Private Transitions . . 53 4.4 Limited Locality Classifier . . . . . . . . . . 55 4.5 Selection of PCT . . . . . . . . . . . . . . . 57 4.6 Overheads of the Locality-Based Protocol . . 57 . . . . 4.3 Storage . . . . . . . . . . . . . . . . 57 4.6.2 Cache & Directory Accesses . . . . . 58 4.6.3 Network Traffic . . . . . . . . . . . . 59 4.6.4 Transitioning between Private/ Remote . . . 4.6.1 60 . Modes Simpler One-Way Transition Protocol . . . . 60 4.8 Synergy with ACKwise . . . . . . . . . . . . 60 4.9 Potential Advantages of Locality-Aware Cache Coherence . . . . . . . 61 . . . . . . . . . . . . . . . . . . 61 4.10.1 Evaluation Metrics . . . . . . . . . . . . . . . . . 62 4.11 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 . 4.10 Evaluation Methodology . . . 4.7 . 4 3.5.1 . 3.6 . 3.2 4.11.1 Energy and Completion Time Trends 8 63 . . . . . . . . . . . . . . . . . . . . . 68 4.11.3 Tuning Remote Access Thresholds . . . . . . . . . . . . . . . . 68 4.11.4 Limited Locality Tracking . . . . . . . . . . . . . . . . . . . . 69 4.11.5 Simpler One-Way Transition Protocol . . . . . . . . . . . . . . 71 4.11.6 Synergy with ACKwise . . . . . . . . . . . . . . . . . . . . . . 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.11.2 Static Selection of PCT 4.12 Sum m ary 5 75 Timestamp-based Memory Ordering Validation 5.1 Principal Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2 Background: Out-of-Order Processors . . . . . . . . . . . . . . . . . . 78 5.2.1 Load-Store Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2.2 Lifetime of Load/Store Operations . . . . . . . . . . . . . . . 80 5.2.3 Precise State . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Background: Memory Models Specification . . . . . . . . . . . . . . . 81 5.3.1 Operational Specification . . . . . . . . . . . . . . . . . . . . . 82 5.3.2 Axiomatic Specification . . . . . . . . . . . . . . . . . . . . . 83 Timestamp-based Consistency Validation . . . . . . . . . . . . . . . . 84 5.4.1 Simple Implementation of TSO Ordering . . . . . . . . . . . . 84 .5.4.2 Basic Timestamp Algorithm . . . . . . . . . . . . . . . . . . . 85 5.4.3 Finite History Queues . . . . . . . . . . . . . . . . . . . . . . 92 5.4.4 In-Flight Transaction Timestamps . . . . . . . . . . . . . . . .. 95 5.4.5 Mixing Remote Accesses and Private Caching . . . . . . . . . 97 5.4.6 Parallel Remote Stores . . . . . . . . . . . . . . . . . . . . . . 99 5.4.7 O verheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.8 Forward Progress & Starvation Freedom Guarantees . . . . . . 103 D iscussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.1 Other Memory Models . . . . . . . . . . . . . . . . . . . . . . 103 5.5.2 Multiple Clock Domains . . . . . . . . . . . . . . . . . . . . . 104 Parallelizing Non-Conflicting Accesses . . . . . . . . . . . . . . . . . . 105 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.3 5.4 5.5 5.6 5.6.1 9 5.7 5.8 5.9 6 5.6.2 TLB Miss Handler . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6.3 Page Protection Fault Handler . . . . . . . . . . . . . . . . . . 107 5.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.6.5 Combinining with Timestamp-based Speculation . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . . . . . . . . . . . 109 R esults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.8.1 Comparison of Schemes . . . . . . . . . . . . . . . . . . . . . . 109 5.8.2 Sensitivity to PCT . . . . . . . . . . . . . . . . . . . . . . . . 114 5.8.3 Sensitivity to History Retention Period (HRP) . . . . . . . . . 116 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Evaluation Methodology 5.7.1 Performance Models 5.7.2 Energy Models Summary 119 Locality-aware LLC Replication Scheme 6.1 6.2 6.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.1.1 Cache Line Reuse . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.1.2 Cluster-level Replication . . . . . . . . . . . . . . . . . . . . . 121 6.1.3 Proposed Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Locality-Aware LLC Data Replication . . . . . . . . . . . . . . . . . . 124 6.2.1 Protocol Operation . . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.2 Limited Locality Classifier Optimization . . . . . . . . . . . . 130 6.2.3 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 . . . . . . . . . . . . . . . . . . . . 133 6.3.1 Replica Creation Strategy 6.3.2 Coherence Complexity . . . . . . . . . . . . . . . . . . . . . . 133 6.3.3 Classifier Organization . . . . . . . . . . . . . . . . . . . . . . 134 6.4 Cluster-Level Replication . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.5 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.5.1 Baseline LLC Management Schemes . . . . . . . . . . . . . . . 137 6.5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 138 10 6.7 6.6.1 Comparison of Replication Schemes . . . . . . . . . . 139 6.6.2 LLC Replacement Policy . . . . . . . . . . . . . . . . 144 6.6.3 Limited Locality Classifier . . . . . . . . . . . . . . . 145 6.6.4 Cluster Size Sensitivity Analysis . . . . . . . . . . . . 146 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 . . . . . . . . . . . . . . . 139 Summary 149 Locality-Aware Cache Hierarchy Replication Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.2.1 Microarchitecture Modifications . . . . . . . . . . . . . . . . 151 7.2.2 Protocol Operation . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.2.4 Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 . . . . . . . . . . . . . . . . . . . . . . . . 157 . . . . . . 7.1 152 Evaluation Methodology 7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 . . . . . . . . . . . . . . . . 161 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 . PCT and RT Threshold Sweep Summary . 7.4.1 7.5 . 7.3 . 7 Results . . . . . . . . . . . . . . . . . . . . . 6.6 165 8 Related Work Data Replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8.2 Coherence Directory Organization . . . . . . . . . . . . . . . . . . . 169 8.3 Selective Caching / Dead-Block Eviction . . . . . . . . . . . . . . . 171 8.4 Remote Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.5 Data Placement and Migration . . . . . . . . . . . . . . . . . . . . 173 8.6 Cache Replacement Policy . . . . . . . . . . . . . . . . . . . . . . . 175 8.7 Cache Partitioning / Cooperative Cache Management . . . . . . . . 176 8.8 Memory Consistency Models . . . . . . . . . . . . . . . . . . . . . . 178 8.9 On-Chip Network and DRAM Performance . . . . . . . . . . . . . . 178 . . . . . . . . 8.1 11 9 181 Conclusion 9.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 9.2.1 Hybrid Software/Hardware Techniques . . . . . . . . . . . . . 184 9.2.2 Classifier Compression . . . . . . . . . . . . . . . . . . . . . . 184 9.2.3 Optimized Variant of Non-Conflicting Scheme . . . . . . . . . 184 12 List of Figures 2-1 Architecture of the baseline system. Each core consists of a com- pute pipeline, private Li instruction and data caches, a physically distributed shared L2 cache with integrated directory and a network rou ter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3-1 Structure of an ACKwise, coherence directory entry . . . . . . . . . . 37 3-2 Broadcast routing on a mesh network from core B. X-Y dimensionorder routing is followed. . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 Illustration of deadlock when using broadcasts on a 1-dimensional mesh network with X-Y routing. . . . . . . . . . . . . . . . . . . . . . . . . 3-4 39 40 Completion time of the ACKwisek protocol when k is varied as 1, 2, 4, 8, 16 & 256. Results are normalized to the full-map protocol (k = 256). 42 Energy of the ACKwisek protocol when k is varied as 1, 2, 4, 8, 16 & 3-5 256. Results are normalized to the full-map protocol (k = 256). 3-6 . . 43 Completion Time of the ACKwisek protocol when run on a mesh network without broadcast support. NB signifies the absence of broadcast support. k is varied as 1, 2, 4, 8 & 16. Results are normalized to the performance of the ACKwise 4 protocol on a mesh with broadcast support. 44 Completion time of the DirkNB protocol when k is varied as 2, 4, 8 & 3-7 16. Results are normalized to the ACKwise 4 protocol. . . . . . . . . . 45 . . . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . . 48 4-1 Invalidations vs Utilization. 4-2 Evictions vs Utilization. 13 4-3 (j), ( and @ are mockup requests showing the two modes of accessing on- chip caches using our locality-aware protocol. Since the black data block has high locality with respect to the core at (), the directory at the home- node hands out a private copy of the cache line. On the other hand, the low-locality red data block is always cached in a single location at its homenode, and all requests (g, accesses. 4-4 @) are serviced using roundtrip remote-word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Each cache line is initialized to Private with respect to all sharers. Based on the utilization counters that. are updated on each memory access to this cache line and the parameter PCT, the sharers are transitioned between Private and Remote modes. Here utilization = (private + remote) utilization. 50 4-5 Each LI cache tag is extended to include additional bits for tracking (a) private utilization, and (b) last-access time of the 4-6 cache line. . . . . . . . 50 A CKwise, - Complete classifier directory entry. The directory entry contains the state, tag, ACKwisep pointers as well as (a) mode (P/R.), (b) remote utilization counters and (c) last-access timestamps for tracking the locality of all the cores in the system . 4-7 . . . . . . . . . . . . . . . . . . . . . . . 51 The limited locality classifier extends the directory entry with mode, utilization, and RAT-level bits for a limited number of cores. A majority vote of the modes of tracked cores is used to classify new cores as private or remote sharers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8 Variation of Energy with PCT. Results are normalized to a PCT of 1. Note that Average and not Geometric-Mean is plotted here. 4-9 55 . . . . . 64 Variation of Completion Time with PCT. Results are normalized to a PCT of 1. Note that Average and not Geometric-Mean is plotted here. 64 4-10 Li Data Cache Miss Rate and Miss Type Breakdown vs PCT. Note that in this graph, the miss rate increases from left to right as well as from top to bottom . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 65 4-11 Variation of Geometric-Means of Completion Time and Energy with Private Caching Threshold (PCT). Results are normalized to a PCT ofI. ......... 68 .................................... 4-12 Remote Access Threshold sensitivity study for n.RA Teeis (L) and RA Tma, 68 (T ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13 Variation of Completion Time and Energy with the number of hardware locality counters (k) in the Limitedk classifier. Limited 64 is identical to the Complete classifier. Benchmarks for which results are not shown are identical to WATER-SP, i.e., the Completion Time and Enerqy stay constant as k varies. 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14 Cache miss rate breakdown variation with the number of hardware locality counters (k) in the Limitedk classifier. Limited 64 is identical to the Complete classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4-15 Ratio of Completion Time and Energy of Adaptl -ay over Adapt2 _way 71 4-16 Synergy between the locality-aware coherence protocol and ACKwise. Variation of average and maximum sharer count during invalidations as a function of PCT . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5-1 Operational Specification of the TSO Memory Consistency Model. . . 82 5-2 Microarchitecture of a multicore tile. The orange colored modules are added to support the proposed modifications. . . . . . . . . . . . . . 85 5-3 Structure of a load queue entry. . . . . . . . . . . . . . . . . . . . . . 87 5-4 Structure of a store queue entry. . . . . . . . . . . . . . . . . . . . . . 87 5-5 Structure of a page table entry. . . . . . . . . . . . . . . . . . . . . . 106 5-6 Completion Time breakdown for the schemes evaluated. Results are normalized to that of R-NUCA. Note that Aerage and not GeometricM ean is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 110 Energy breakdown for the schemes evaluated. Results are normalized to that of R-NUCA. Note that Average and not Geometric-Mean is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .111 15 5-8 Completion Time and Energy consumption as PCT varies from 1 to 16. Results are normalized to a PCT of 1 (i.e., Reactive-NUCA protocol). 5-9 Completion Time sensitivity to History Retention Period (HRP) as HRP varies from 64 to 4096. . . . . . . . . . . . . . . . . . . . . . . . 6-1 115 116 Distribution of instructions, private data, shared read-only data, and shared read-write data accesses to the LLC as a function of run-length. The classification is done at the cache line granularity. . . . . . . . . . . . . . . . . 6-2 120 Distribution of the accesses to the shared L2 cache as a function of the sharing degree. The accesses are broken down into (1) Instruction, (2) Shared Read-Only (RO) Data, (3) Shared Read-Write (RW) Data Read and (4) Shared Read-Write (RW) Data Write. 6-3 . . . . . . . . . . . . . . . . . . . 122 Distribution of the accesses to the shared L2 cache as a function of the sharing degree. The accesses are broken down into (1) Instruction, (2) Shared Read-Only (RO) Data, (3) Shared Read-Write (RW) Data Read and (4) Shared Read-1Write (RWV) Data Write. 6-4 - . . . . . . . . . . . . . . . . . . . 123 are mockup requests showing the locality-aware LLC replication protocol. The black data block has high reuse and a local LLC replica is allowed that services requests from ) and g. The low-reuse red data block is not allowed to be replicated at the LLC, and the request from @ that misses in the LI, must access the LLC slice at its home core. The home core for each data block can also service local private cache misses (e.g., T). . . 6-5 125 Each directory entry is extended with replication mode bits to classify the usefulness of LLC replication. Each cache line is initialized to non-replica mode with respect to all cores. Based on the reuse counters (at the home as well as the replica location) and the parameter RT, the cores are transitioned between replica and non-replica modes. Here XReuse is (Replica + Home) Reuse on an invalidation and Replica Reuse on an eviction. 16 . . . . . . . . 126 6-6 ACKwiseP-Complete locality classifier LLC tag entry. It contains the tag, LRU bits and directory entry. The directory entry contains the state, ACKwisep pointers, a Replica reuse counter as well as Replication mode bits and Home reuse counters for every core in the system. 6-7 . . . . . . . . . . . . . . . . 127 ACKwise -Limitedk locality classifier LLC tag entry. It contains the tag, LRU bits and directory entry. The directory entry contains the state, ACKwisep pointers, a Replica reuse counter as well as the Limitedk classifier. The Limitedk classifier contains a Replication mode bit and Home reuse counter for a limited number of cores. A majority vote of the modes of tracked cores is used to classify new cores as replicas or non-replicas. 6-8 . . . . . . . . . . 130 Energy breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA. Note that Average and not Geometric-Mean is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9 140 Completion Time breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA. Note that Average and not Geometric-Mean is plotted here. . . . . . . . . . . . . . . . . . . . . . . 141 6-10 Li Cache Miss Type breakdown for the LLC replication schemes evaluated. 142 6-11 Energy and Completion Time for the Limitedk classifier as a function of number of tracked sharers (k). The results are normalized to that of the Complete (= Limited 64 ) classifier. . . . . . . . . . . . . . . . . . . . . . 145 6-12 Energy and Completion Time at cluster sizes of 1, 4. 16 and 64 with the locality-aware data replication protocol. A cluster size of 64 is the same as R-NUCA except that it does not even replicate instructions. . . . . . . . . 17 147 7-1 The red block prefers to be in the I" mode, being replicated only at the L1-I/L1-D cache. Cores N and P access data directly at the Li cache. The & blue block prefers to be in the 2 nd mode, being replicated at both the LI L2 caches. Cores ( and P access data directly at the L. cache. The violet block prefers to be in the 3,d mode, and is accessed remotely at the L2 home location without being replicated in either the Li or the L2 cache. Cores P and @ access data using remote-word requests at the L2 home location. And finally, the green block prefers the 41h mode, being replicated at the L2 cache and accessed using word accesses at the L2 replica location. Cores and U access data, using remote-word requests. 7-2 Q . . . . . . . . . . . . . . 150 Modifications to the L2 cache line tag. Each cache line is augmented with a Private Reuse counter that tracks the number of times a cache line has been accessed at the L2 replica location. In addition, each cache line tag has classifiers for deciding whether or not to replicate lines in the Li cache & L2 cache. Both the Li & L2 classifiers contain the Mode, Home Reuse, and RAT-Level fields that serve to remember the past locality information for each cache line. The above information is only maintained for a limited number of cores, k, and the mode of untracked cores is obtained by a majority vote. 7-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Completion Time breakdown for the schemes evaluated. Results are normalized to that of R-NUCA. Note that Average and not GeometricM ean is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4 159 Energy breakdown for the schemes evaluated. Results are normalized to that of R-NUCA. Note that Average and not Geometric-Mean is plotted here. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5 Variation of Completion Time as a function of PCT& RT. The GeometricMean of the completion time obtained from all benchmarks is plotted. 7-6 160 162 Variation of Energy as a function of PCT & RT. The Geometric-Mean of the energy obtained from all benchmarks is plotted. 18 . . . . . . . . 162 List of Tables 2.1 Architectural parameters used for evaluation . . . . . . . . . . . . . . 33 2.2 Projected Transistor Parameters for 11 nm Tri-Gate . . . . . . . . . . 34 2.3 Problem sizes for our parallel benchmarks. . . . . . . . . . . . . . . . 35 4.1 Cache sizes per core. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Storage required for the caches and coherence protocol per core. . . . 58 4.3 Locality-aware cache coherence protocol parameters . . . . . . . . . . 61 5.1 Timestamp-based speculation violation detection & locality-aware cache private cache replication scheme parameters. . . . . . . . . . . . . . . 108 . . . . . . . . . . . 137 6.1 Locality-aware LLC (7L2) replication parameters 7.1 Locality-aware protocol & timestamp-based speculation violation detection param eters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 157 20 Chapter 1 Introduction Increasing the number of cores has replaced clock frequency scaling as the method to improve performance in state-of-the-art multicore processors. These multiple cores can either be used in parallel by multiple applications or by multiple threads of the same application to complete work faster. Maintaining good multicore scalability while ensuring good single-core performance is of the utmost importance in continuing to improve performance and energy efficiency. 1.1 Performance and Energy Efficiency In the era of multicores, programmers need to invest more effort in designing software capable of exploiting multicore parallelism. To reduce memory access latency and power consumption, a programmer can manually orchestrate communication and computation or adopt the familiar programming paradigm of shared memory. But will current shared memory architectures scale to many cores? This thesis addresses the question of how to enable low-latency, low-energy memory access while retaining the convenience of shared memory. Current semiconductor trends project the advent of single-chip multicores dealing with data at unprecedented scale and complexity. Memory scalability is critically constrained by off-chip bandwidth and on-chip latency and energy consumption [171. For multicores dealing with massive data, memory access latency and energy consumption 21 are now first-order design constraints. 1.1.1 Cache Management A large, monolithic physically shared on-chip cache does not scale beyond a small number of cores, and the only practical option is to physically distribute memory in pieces so that every core is near some portion of the cache [131. In theory this provides a large amount of aggregate cache capacity and fast private memory for each core. Unfortunately, it is difficult to manage distributed private caches effectively as they require architectural support for cache coherence and consistency. Popular directory-based protocols enable fast local caching to exploit data locality, but scale poorly with increasing core counts [6, 66]. Many recent proposals (e.g., Tagless Directory [96], SPACE [99], SPATL [100], SCD [801, In-Network Cache Coherence [29]) have addressed directory scalability in single-chip multicores using complex sharer compression techniques or on-chip network capabilities. Caches in state-of-the-art multicore processors are typically organized into multiple levels to take advantage of the multiple working sets in an application. The lower level(s) of the cache hierarchy (i.e., closer to the compute pipeline) have traditionally been private to each core while the the highest level a.k.a. last-level cache (LLC) has been diverse across production processors [13, 381. Different LLC organizations offer trade-offs between on-chip data locality and off-chip miss rate. While private LLC organizations (e.g., AMD Opteron [261, Knights Landing [4]) have low hit latencies, their off-chip miss rates are high in applications that exhibit high degrees of data sharing (due to cache line replication). In addition, private LLC organizations are inefficient when running multiple applications that have uneven distributions of working sets. Shared LLC organizations (e.g., [21), on the other hand, lead to non-uniform cache access latencies (NUCA) [50, 421 that hurt on-chip locality. However, they improve cache utilization and thereby, off-chip miss rates since cache lines are not replicated. Exploiting spatio-temporal locality in such organizations is challenging since cache latency is sensitive to data placement. To address this problem, coarse-grain data 22 placement, migration and restrictive replication schemes have been proposed (e.g., Victim Replication [97], Adaptive Selective Replication [10], Reactive-NUCA [39], CMP-NuRAPID [24]). These proposals attempt to combine the good characteristics of private and shared LLC organizations. All previous proposals assume private low-level caches and use snoopy/directorybased cache coherence to keep the private caches coherent. A request for data allocates and replicates a data block in the private cache hierarchy even if the data has no spatial or temporal locality. This leads to cache pollution since such low locality data can displace more frequently used data. Since on-chip wires are not scaling at the same rate as transistors [16], unnecessary data movement and replication not only impacts latency, but also consumes extra network and cache power [661. In addition, previous proposals for hybrid last-level cache (LLC) management either do not perform fine-grained adaptation of their policies to dynamic program behavior or replicate cache lines based on static policies without paying attention to their locality. In addition, some of them significantly complicate coherence (e.g., 124]) and do not scale to large core counts. 1.1.2 Single-core Performance In addition to maintaining multicore scalability, good single-core performance must be ensured to optimally utilize the functional units on each core and thereby, extract the best performance. Since the working sets of application rarely fit within the Li cache, ensuring good single-core performance requires the exploitation of the memory level parallelism (MLP) in an application. MLP can be exploited using out-of-order (000) processors through dynamically scheduling independent memory operations. With in-order processors, prefetching (both software/hardware) and static compiler scheduling techniques can be used to exploit MLP. However, it is worth noting that current industry trends are moving towards out-of-order processors in em- bedded (Atom [72], ARM [36]), server (Xeon [381, SPARC [11) and high-performance computing processors (Knights Landing [41). 23 1.2 Programmability and Memory Models In addition to maintaining good performance and energy efficiency, future multicore processors have to maintain ease of programming. The programming complexity is significantly affected by the memory consistency model of the processor. The memory model dictates the order in which the memory operations of one thread appear to another. The strongest memory model is the Sequential Consistency (SC) [591 model. SC mandates that the global memory order is an interleaving of the memory accesses of each thread with each thread's memory accesses appearing in program order in this global order. SC is the most intuitive model to the software developer and is the easiest to program and debug with. Production processors do not implement SC due to its negative performance impact. SPARC RMO [1], ARM [36] and IBM Power [65] processors implement relaxed (/weaker) memory models that allow reordering of load & store instructions with explicit fences for ordering when needed. These processors can better exploit memory level parallelism (MLP), but require careful programmer-directed insertion of memory fences to do so. Intel x86 [83j and SPARC [1] processors implement Total Store Order (TSO), which attempts to strike a balance between programmability and performance. The TSO model only relaxes the Store-+Load ordering of SC, and improves performance by enabling loads (that are crucial to performance) to bypass stores in the write buffer. Any new optimizations implemented for improving energy and performance must be compatible with the memory consistency model of the processor and all existing optimizations. Otherwise, the existing code can no longer be supported and programmers should undergo constant re-training to stay updated with the latest changes to the memory model. An alternate solution is to off-load the responsibility of adapting to memory model changes to the compiler while providing a constant model to the programmer. However, this requires automated insertion of fences which is still an active area of research [7]. 24 1.3 Thesis Contributions This thesis makes the following five principal contributions that holistically address performance, energy efficiency and programmability. 1. Proposes a scalable limited directory-based coherence protocol, ACKwise [56, 571 that reduces the directory storage needed to track the sharers of a data block. 2. Proposes a Locality-aware Private Cache Replication scheme [54] to better manage the private caches in multicore processors by intelligently controlling data caching and replication. 3. Proposes a Timestamp-based Memory Ordering Validation technique that enables the preservation of familiar memory consistency models when the intelligent private cache replication scheme is applied to state-of-the-art production processors. 4. Proposes a Locality-aware LLC Replication scheme [53] that better manages the last-level shared cache (LLC) in multicore processors by balancing shared data (and instruction) locality and off-chip miss rate through controlled replication. 5. Proposes a Locality-Aware Adaptive Cache Hierarchy Management Scheme that seamlessly combines all the above schemes to provide optimal data locality and miss rates at all levels of the cache hierarchy. These five contributions are briefly summarized below. 1.3.1 ACKwise Directory Coherence Protocol This thesis proposes the ACKwise limited-directory based coherence protocol. ACKwise, by using a limited set of hardware pointers to track the sharers of a cache line, incurs much less area and energy overhead than the conventional directory-based protocol. If the number of sharers exceeds the number of hardware pointers, ACKwise 25 does not track the identities of the sharers anymore. Instead, it tracks the number of sharers. On an exclusive request, an invalidation request is broadcast to all the cores. However, acknowledgements need to be sent by only the actual sharers since the number of sharers is tracked. The invalidation broadcast is handled efficiently by making simple changes to an electrical mesh network. The ACKwise protocol is advantageous because it achieves the performance of the full-map directory protocol while reducing the area and energy consumption dramatically. 1.3.2 Locality-aware Private Cache Replication [541 This thesis proposes a scalable, efficient protocol that better manages private caches by enabling seamless adaptation between private and logically shared caching at the fine granularity of cache lines. When a core makes a memory request that misses the private caches, the protocol either brings the entire cache line using a directory-based coherence protocol, or just accesses the requested word at the shared cache location using a roundtrip message over the network. The second style of data management is called "remote access". The decision is made based on the spatio-temporallocality of a particular data block. The locality of each cache line is profiled at runtime by measuring its reuse, i.e., the number of times the cache line is accessed before being removed from its private cache location. Only data blocks with high spatio-temporal locality (i.e., high reuse) are allowed to be privately cached. A low-overhead, highly accurate hardware predictor that tracks the locality of cache lines is proposed. The predictor takes advantage of the correlation exhibited between the reuse of multiple cores for a cache line. Only the locality information for a few cores per cache line is maintained and the locality information for others is predicted by taking a majority vote. This locality tracking mechanism is decoupled from the sharer tracking structures. Hence, this protocol can work with ACKwise or any other directory coherence protocol that enables scalable tracking of sharers. However, it is worth noting that the locality-aware protocol makes ACKwise even more efficient since it reduces the number of privately-cached copies of a data block. Locality-aware Private Cache Replication is advantageous because it: 26 1. Better exploits on-chip private cache capacity by intelligently controlling data caching and replication. 2. Lowers memory access latency by trading off unnecessary cache evictions or expensive invalidations with much cheaper word accesses. 3. Lowers energy consumption by better utilizing on-chip network and cache resources. 1.3.3 Timestamp-based Memory Ordering Validation This thesis explores how to extend the above locality-aware private cache replication scheme for processors with out-of-order memory (i.e., supporting multiple outstanding memory transactions using non-blocking caches) that employ popular memory models. The remote access required by the locality-aware coherence protocol is incompatible with the following two optimizations implemented in such processors to exploit memory level parallelism (MLP). (1) Speculative out-of-order execution is used to improve load performance, enabling loads to be issued and completed before previous load/fence operations. Memory consistency violations are detected when invalidations, updates or evictions are made to addresses in the load buffer. The pipeline state is rolled back if this situation arises. (2) Exclusive store prefetch [33] requests are used to improve store performance. These prefetch requests fetch the cache line into the L1-D cache and can be executed out-of-order and in parallel. However, the store requests must be issued and completed following the ordering constraints specified by the memory model. But, most store requests hit in the L1-D cache (due to the earlier prefetch request) and hence can be completed quickly. Remote access is incompatible due to the following two reasons. (1) A remote load does not create private cache copies of cache lines and hence, invalidation/ update requests cannot be used to detect memory consistency violations for speculatively executed operations. (2) A remote store also never creates private cache copies and hence, exclusive store prefetch requests cannot be employed to improve performance. In this thesis, a novel technique is proposed that uses timestamps to detect memory 27 consistency violations when speculatively executing loads under the locality-aware protocol. Each load and store operation is assigned an associated timestamp and a simple arithmetic check is done at commit time to ensure that memory consistency has not been violated. This technique does not rely on invalidation/update requests and hence is applicable to remote accesses. The timestamp mechanism is efficient due to the observation that consistency violations occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. This technique works completely in hardware and requires only 2.5 KB of storage per core. This scheme guarantees forward progress and is starvation-free. An implementation of the technique for the Total Store Order (TSO) model is provided and adaptations to implement it on other memory models are discussed. 1.3.4 Locality-aware LLC Replication [531 This thesis proposes a data replication mechanism for the LLC that retains the onchip cache utilization of the shared LLC while intelligently replicating cache lines close to the requesting cores so as to maximize data locality. The proposed scheme only replicates those cache lines that demonstrate reuse at the LLC while bypassing the replication overheads for the other cache lines. To achieve this goal, a low-overhead yet highly accurate in-hardware locality classifier (14.5 KB storage per core) is proposed that operates at the cache line granularity and only allows the replication of cache lines with high spatio-temporal locality (similar to the earlier private cache replication technique). This classifier captures the LLC pressure and adapts its replication decision accordingly. This data replication mechanism does not involve "remote access" and hence, does not require special support for adhering to a memory consistency model. Locality-aware LLC Replication is advantageous because it: 1. Lowers memory access latency and energy by selectively replicating cache lines that show high reuse in the LLC slice of the requesting core. 28 2. Better exploits the LLC by balancing the off-chip miss rate and on-chip locality using a classifier that adapts to the run-time reuse at the granularity of cache lines. 3. Allows coherence complexity almost identical to that of a traditional nonhierarchical (flat) coherence protocol since replicas are only allowed to be placed at the LLC slice of the requesting core. The additional coherence complexity only arises within a core when the LLC slice is searched on a private cache miss, or when a cache line in the core's local cache hierarchy is evicted/ invalidated. 1.3.5 Locality-aware Cache Hierarchy Replication This thesis combines the private cache replication and LLC replication schemes discussed in Chapters 4 & 6 with the timestamp-based memory ordering validation technique presented in Chapter 5 into a combined cache hierarchy replication scheme. The design of this scheme is motivated by the experimental observation that both localityaware private cache & LLC replication enable varying performance improvements for benchmarks. Certain benchmarks exhibit improvement only with intelligent private cache replication, certain others exhibit improvement only with locality-aware LLC replication, while certain benchmarks exhibit improvement with both locality-aware private cache & LLC replication. This necessitates the design of a combined replication scheme that exploits the benefits of both the schemes introduced earlier. 1.3.6 Results Summary On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% while incurring a storage overhead of 30.5 KB per core (i.e., 10% the aggregate cache capacity of each core). Locality-aware Private Cache Replication alone improves completion time by 13% and energy by 15% while incurring a storage overhead of 21.5 KB per core. Locality-aware LLC Replication improves completion time by 10% and energy by 15% while incurring a storage overhead of 14.5 KB per core. Note that in the 29 evaluated system, the Li cache is the only private cache level and the L2 cache is the last-level cache (LLC). 1.4 Organization of Thesis The rest of this thesis is organized as follows. Chapter 2 describes the evaluation methodology, including the baseline system. Chapter 3 describes the ACKwise directory coherence protocol. Chapter 4 describes the locality-aware private cache replication scheme. Chapter 5 describes the memory consistency implications of the localityaware scheme. It proposes a novel timestamp-based technique to enable the implementation of the locality-aware scheme in state-of-the-art processors while preserving their memory consistency model. Chapter 6 describes the locality-aware adaptive & LLC replication scheme. Chapter 7 combines the schemes developed in Chapters 4 6 along with the mechanism to detect memory consistency violations in Chapter 5 to propose a locality-aware cache hierarchy management scheme. Chapter 8 describes the related work in detail and compares and contrasts that against the work described in this thesis. Finally, Chapter 9 concludes the thesis. 30 Chapter 2 Evaluation Methodology Baseline Architecture core / 2.1 Compute Pipeline Li I-Cache Li D-Cache L2 Shared Cache Directory Router Figure 2-1: Architecture of the baseline system. Each core consists of a compute pipeline, private Li instruction and data caches, a physically distributed shared L2 cache with integrated directory and a network router. The baseline system is a tiled multicore with an electrical 2-D mesh interconnection network as shown in Figure 2-1. Each core consists of a compute pipeline, private Li instruction and data caches, a physically distributed shared L2 cache with integrated directory, and a network router. The coherence directory is integrated with the L2 slices by extending the L2 tag arrays (in-cache directory organization [18, 13]) and tracks the sharing status of the cache lines in the per-core private Li caches. The 31 private caches are kept coherent using a full-map directory-based coherence protocol. Some cores along with the periphery of the electrical mesh have a connection to a memory controller as well. The mesh network uses dimension-order X-Y routing and wormhole flow control. The shared L2 cache is managed using the data placement, replication and migrations mechanisms of Reactive-NUCA [39] as follows. Private data is placed at the L2 slice of the requesting core, shared data is interleaved at the OS page granularity across all L2 slices, and instructions are replicated at a single L2 slice for every cluster of 4 cores using a rotational interleaving mechanism. Classification of data into private and shared is done at the OS page granularity by augmenting the existing page table and TLB (Translation Lookaside Buffer) mechanisms. We evaluate both 64-core and 256-core multicore processors. The important architectural parameters used for evaluation are shown in Table 2.1. Both in-order and out-of-order cores are evaluated (the parameters for each are shown in Table 2.1). 2.2 Performance Models All experiments are performed using the core, cache hierarchy, coherence protocol, memory system and on-chip interconnection network models implemented within the Graphite [68] multicore simulator. The Graphite simulator requires the memory system (including the cache hierarchy) to be functionally correct to complete simulation. This is a good test that all our cache coherence protocols are working correctly given that we have run 27 benchmarks to completion. The Nehalem decoder for the outof-order core is borrowed from the ZSim simulator [811. The electrical mesh interconnection network uses X-Y routing. network-on-chip routers are pipelined Since modern 127], and 2- or even 1-cycle per hop router latencies [52] have been demonstrated, we model a 2-cycle per hop delay; we also account for the appropriate pipeline latencies associated with loading and unloading a packet onto the network. In addition to the fixed per-hop latency, network link contention delays are also modeled. 32 Architectural Parameter Value Number of Cores Clock Frequency Processor Word Size Physical Address Length 64 & 256 1 GHz 64 bits 48 bits Issue Width In-Order Core 1 Out-of-Order Core Issue Width 1 Reorder Buffer Size 168 Load Queue Size 64 Store Queue Size 48 Memory Subsystem Li-I Cache per core 16 KB, 4-way Assoc., 1 cycle L1-D Cache per core 32 KB, 4-way Assoc., 1 cycle L2 Cache (LLC) per core 256 KB, 8-way Assoc., 7 cycle 2 cycle tag, 4 cycle data Inclusive, R-NUCA [39] Cache Line Size 64 bytes Invalidation-based MESI Directory Protocol Num. of Memory Controllers 8 DRAM Bandwidth 5 GBps per Controller DRAM Latency 75 ns Electrical 2-D Mesh with XY Routing 2 cycles (1-router, 1-link) Hop Latency 64 bits Flit Width 1 flit Header (Src, Dest, Addr, MsgType) Word Length 1 flit (64 bits) Cache Line Length 8 flits (512 bits) Table 2.1: Architectural parameters used for evaluation 2.3 Energy Models For energy evaluations of on-chip electrical network routers and links, we use the DSENT 187] tool. Energy estimates for the Li-I, L1-D and L2 (with integrated directory) caches as well as DRAM are obtained using McPAT [60]. When calculating the energy consumption of the L2 cache, we assume a word addressable cache architecture. We model the dynamic energy consumption of both the word access and the cache line access in the L2 cache using McPAT. 33 The energy evaluation is performed at the 11 nm technology node to account for future scaling trends. We derive models for a trigate 11 nm electrical technology node using the virtual-source transport models of [49] and the parasitic capacitance model of [901. These models are used to obtain electrical technology parameters (Table 2.2) used by both McPAT and DSENT. As clock frequencies are relatively slow, high threshold (HVT) transistors are assumed for lower leakage. Parameter Value Process Supply Voltage Gate Length Contacted Gate Pitch Gate Cap Drain Cap 0.6 V 14 nm 44 nm (VDD) 2.420 fF/pm Width 1.150 fF/pm Width Effective On Current / Width (N/P) Off Current / Width 739/668 pA/pm 1 nA/pm Table 2.2: Projected Transistor Parameters for 11 nm Tri-Gate 2.4 ToolFlow The overall toolflow is as follows. Graphite runs a benchmark for the chosen cache configuration, and cache coherence protocol, producing event counters and performance results. The specified cache and network configurations are also fed into McPAT and DSENT to obtain dynamic per-event energies for each component. Event counters and completion time output from Graphite are then combined with per-event energies to obtain the overall dynamic energy usage of the benchmark. 2.5 Application Benchmarks We simulate SPLASH-2 [93] benchmarks, PARSEC [14] benchmarks, Parallel-MIBench [431, a Travelling-Salesman-Problem (DFs) benchmark, a Matrix-Multiply marks (CONNECTED-COMPONENTS & (TSP) (MATMUL) benchmark, a Depth-First-Search benchmark, and two graph bench- COMMUNITY-DETECTION) 34 [31 using the Graphite Problem Size Application SPLASH-2 [93] 4M integers, radix 1024 1024 x 1024 matrix, 16 x 16 blocks 2050 x 2050 ocean 1026 x 1026 ocean tk29.0 64K particles 258 x 258 ocean 512 molecules 512 molecules RADIX LU-C, LU-NC OCEAN-C OCEAN-NC CHOLESKY BARNES OCEAN WATER-NSQUARED WATER-SPATIAL RAYTRACE car VOLREND head PARSEC [14] BLACKSCHOLES 64K options SWAPTIONS 64 swaptions, 20,000 sims. 1 STREAMCLUSTER 16384 points per block, DEDUP 31 MB data block FERRET 256 queries, 34,973 images BODYTRACK 4 frames, 4000 particles FLUIDANIMATE 5 CANNEAL 200,000 elements FACESIM 1 frame, 372,126 tetrahedrons frames, 100,000 particles Parallel MI Bench [43] DIJKSTRA-SINGLE-SOURCE Graph with 4096 nodes DIJKSTRA-ALL-PAIRS Graph with 512 nodes PATRICIA SUSAN 5000 IP address queries PGM picture 2.8 MB UHPC [3] CONNECTED-COMPONENTS Graph with 218 nodes COMMUNITY-DETECTION Graph with 216 nodes Others TSP 16 cities MATRIX-MULTIPLY 1024 x 1024 matrix DFS Graph with 876800 nodes Table 2.3: Problem sizes for our parallel benchmarks. 35 multicore simulator. The graph benchmarks model social networking based applications. The problem sizes for each application are shown in Table 2.3. 36 Chapter 3 ACKwise Directory Coherence Protocol This chapter presents A CKwise, a directory coherence protocol derived from an MSI directory based protocol. Each directory entry in this protocol, as shown in Figure 3-1 is similar to one used in a limited directory organization [6] and contains the following 3 fields: (1) State: This field specifies the state of the cached block (one of the MSI states); (2) Global(G): This field states whether the number of sharers for this data block exceeds the capacity of the sharer list. If so, a broadcast is needed to invalidate all the cached blocks corresponding to this address when a cache demands exclusive ownership; (3) Sharers 1,_: This field represents the sharer list. It can track upto p distinct core IDs. G Core IDTCrag D Figure 3-1: Structure of an ACKwisep coherence directory entry ACKwise operates similar to a full-map directory protocol when the number of sharers is less than or equal to the number of hardware pointers (p) in the limited directory. When the number of sharers exceeds the number of hardware pointers (p), the Global(G) bit is set to true so that any number of sharers beyond this point can 37 be accommodated. Once the global (G) bit is set to true, the sharer list (Sharers,p) just holds the total number of sharers of this data block. 3.1 Protocol Operation When a request for a shared copy of a data block is issued, the directory controller first checks the state of the data block in the directory cache. (a) If the state is Invalid(I), it forwards the request to the memory controller. The memory controller fetches the data block from memory and sends it directly to the requester. It also sends an acknowledgement to the directory. The directory changes the state of the data block to Shared(S). (b) If the state is Shared(S), the data is fetched from the L2 cache and forwarded to the requester. (c) If the state is Modified(M), the data is fetched from the Li cache of the owner and forwarded to the requester and the state is set to Shared(S). In all the above cases, the directory controller also tries to add the ID of the requester to the sharer list. This is straightforward if the global(G) bit is clear and the sharer list has vacant spots. If global(G) bit is clear but the sharer list is full, it sets the global(G) bit to true and stores the total number of sharers in the sharer list. If the global(G) bit is already set to true, then it increments the number of sharers by one. When a request for an exclusive copy of a data block is issued, the directory controller first checks the state of the data block in the directory cache. (a) If the state is Invalid(), the sequence of actions followed is the same as that for the read request except that the state of the data block in the directory is set to Modified(M) instead of Shared(S). (b) If the state is Shared(S), then the actions performed by the directory controller is dependent on the state of global (G) bit. If the global (G) bit is clear, it unicasts invalidation messages to each core in the sharer list. Else, if the global (G) bit is set, it broadcasts an invalidation message. The sharers invalidate their cache blocks and acknowledge the directory. The directory controller expects as many acknowledgements as the number of sharers (encoded in the sharer list if the global(G) bit is set and calculated directly if the global(G) bit is clear). After all 38 the acknowledgements are received, the directory controller sets the state of the data block to Modified(M) and the global(G) bit to false. (c) If the state is Modified(M), the directory flushes the owner and forwards the data to the requester. In all cases, the requester is added to the sharer list as well. 3.2 Silent Evictions Silent evictions of cache lines cannot be supported by this protocol since the exact number of sharers has to be always maintained. However, since evictions are not on the critical path, they do not hurt performance other than by increasing network contention delays negligibly. Since evictions do not contain data, the network energy overhead is also very small. In a 1024-core processor, performance is only reduced by 1% and total network flits sent is only reduced by 3% if silent evictions are used. Figure 3-2: Broadcast routing on a mesh network from core B. X-Y dimension-order routing is followed. 3.3 Electrical Mesh Support for Broadcast The mesh has to be augmented with native hardware support for broadcasts. The broadcast is carried out by configuring each router to selectively replicate a broad- 39 Figure 3-3: Illustration of deadlock when using broadcasts on a 1-dimensional mesh network with X-Y routing. casted message on its output links. Dimension order X-Y routing is followed. Figure 3-2 illustrates how a broadcast is carried out starting from core B. Even though unicasts with dimension-order X-Y routing do not cause network deadlocks, broadcasts with the same routing scheme lead to deadlocks (assuming no virtual channels). This can be illustrated using the example in Figure 3-3. There are four broadcast packets and two routers. Packet-1 has three flits (1H, 1B, IT), packet-2 has three flits (2H, 2B, 2T), packet-3 has two flits (3H, 3T) and packet-4 has two flits (4H, 4T). Assume that the mesh is 1-dimensional for simplicity. Packet-i wants to move to Router-B and is stuck behind Packet-4. Packet-4 wants to move to Core-B (i.e., the core attached to Router-B) and is waiting for Packet-2 to finish transmitting. Packet-2 wants to move to Router-A and is stuck behind Packet-3. And Packet-3 wants to move to Core-A and is waiting for Packet-i to finish transmitting, thereby completing the circular dependency. Virtual channels can be used to avoid these deadlocks (e.g., Virtual circuit multicasting 147]). Without virtual channels, the broadcast packets have to be handled using virtual cut-through flow control to ensure forward progress. With virtual cutthrough, Packet-i would completely reside in Router-A and Packet-2 in Router-B, thereby removing the circular dependency. Unicast packets can still use wormhole flow control. Using two flow control techniques in the same network, one for broadcast and another for unicast, increases complexity. 40 Also, virtual cut-through flow control places restrictions on the number of flits in each input port flit-buffer. To avoid such complications, the size of each broadcast packet is restricted to 1 flit. This restriction is entirely reasonable since with a 64-bit flit, 48 bits can be allocated for the physical address and 10 bits for the sender core ID (assuming a 1024-core processor). The remaining 6 bits suffices for storing the invalidation message type. This restriction allows wormhole flow control to be used for all packets. The deadlock and its avoidance mechanism have been simulated using a cycle-level version of Graphite employing finite buffer models in the network and were found to operate as described above. 3.4 Evaluation Methodology We evaluate a 256-core shared memory multicore using out-of-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The mesh network is equipped with broadcast support as described unless otherwise mentioned. 3.5 Results In this section, the performance and energy efficiency of ACKwise is evaluated. First, the sensitivity of ACKwise to the number of hardware pointers k is evaluated. Next, its sensitivity to broadcast support is evaluated. And finally, ACKwisek is compared to an alternate limited directory-based protocol DirkNB 3.5.1 [6]. Sensitivity to Number of Hardware Pointers (k) Here, the sensitivity of the ACKwisek protocol to the number of hardware pointers k, is evaluated. The value k is varied from 1 to 256 in the following order: 1, 2, 4, 8, 16 & 256. The value 256 corresponds to the full-map directory organization, which allocates a bit to track the sharing status of every core in the system. The other values of k correspond to ACKwise with different numbers of hardware pointers. Each hardware pointer is 8 bits long (since log2(256) = 8). So, with k = 1, 2, 4, 8, 41 16 & 256, the number of bits per cache line allocated to sharer tracking is 8, 16, 32, 64, 128 & 256 respectively. " Instructions " Branch Speculation R L-1 Fetch Stalls I Compute Stalls M Idle M Synchronization U Memory Stalls 1.2 , II 0 '0 1 f 2s 44 4-0.2440 44 440 16 "026. R 2. 111111apl111111 In 44 1stsar nraize L-I etc StRls: Stall~ timeB. u 1113fllma prtclik oisrcinccems dw tinth followigds. tie isaried (aLU,,FP,8 5) Thee copltio timies rknd tgris Sallsn: Stall time due to 2. Li-ych 3. Comue: Sntalls iespetaliedt 7.dlest tial stppmr spnt g protor whnt as kaistvared 1. Insrunctios:eumberoofintrm uto 4.oMmoryetals:Stl time l256) e Fiu3-4:Comletions tmeo th e t wii 16 ult6.ipltre. nrslizdt.fl-appoool( Figurem3-4ypltas: thecltion = is-eppdictiobrnchisruts itrugton che waiting for bmirses adcndt funhctdtioensal ni(ed.FU uite aiton ladstoe ueef scapacit lims, ncesdh waitin fetork as thread to besand.eivldainb d cast requests created as a result of the limited directory size well. The number of 42 acknowledgement responses sent by ACKwisek on the other hand, is independent of the value of k and is equal to that sent by the full-map directory (since ACKwisek tracks the exact number of sharers). Overall, the ACKwise 4 protocol is better than the other variants by ~ 0.5%. 1.2 '0 0.8 00.6 " Li-1 Cache 10 L1-D Cache " Network Router U 0 L2 Cache Network Link N Directory N DRAM II_ t w0.2 0 44 BARNES 44440rx CHOLESKY 4<44 LU-C LU-NC CANNEAL ;Z44 FLUIDANIM. 44Z4 4<44 OCEAN-C 44(40 41-t0 W44;Z40 RADIX AVERAGE Figure 3-5: Energy of the ACKwisek protocol when k is varied as 1, 2, 4, 8, 16 & 256. Results are normalized to the full-map protocol (k =256). Figure 3-5 plots the energy as a function of k. Initially, energy is found to reduce as k is increased. This is due to the reduction in network traffic obtained when the number of broadcasts is reduced. However, as k increases further, the directory size increases to track more sharers. This increases the overall energy consumption. The full-map protocol exhibits the largest directory size, and hence the highest directory component of energy. Overall, the ACKwise 2 , ACKwise 4 and ACKwise8 protocols show a 3.5% energy savings over Full-Map, while the ACKwise protocol only shows a 1.5% improvement. Based on the performance & energy results, ACKwise4 (with 4 hardware pointers) is chosen as the optimal protocol going ahead even though other variants of ACKwise show similar trends. The ACKwise4 protocol also reduces the area overhead for tracking sharers from 128 KB per core for the full-map directory to a mere 16 KB. 43 3.5.2 Sensitivity to Broadcast Support Here, the sensitivity of the ACKwise protocol to broadcast support is evaluated. This is done by comparing the performance and energy efficiency of ACKwisek on a mesh without broadcast support to the previously evaluated scheme (with broadcast In the absence of specialized broadcast support, a broadcast is handled support). by sending separate unicast messages to all the cores in the system. These unicast messages have to be serialized through the network interface of the sender. This increases their latency and consumes on-chip network bandwidth. E Instructions 0 Branch Speculation Compute Stalls 0 Li-1 Fetch Stalls U 0 Synchronization U Idle U Memory Stalls -1.8 w1.6 1.4 E.0 1.2 1 0ARNE 04 w0.6 ompletion ime of UCT li b sp11111 sIII 11111111111 111111 111111 k11s h lt runonamsnewr 3-6C when Figure rotocolnof ACKwise on a mesh without broadcast support. In the figure, NB is used to indicate the absence of broadcast support. The results are normalized to the ACKwise 4 protocol on a broadcast-enabled mesh. The completion time is high at low values of k. This is due to the large number of broadcasts that cause high invalidation latencies and network traffic disruption. This is evidenced by the increase in memory stall time. The synchronization time also increases if memory stalls occur in the critical regions. As the value of k increases, the completion time reduces due to a reduction in the number of broadcasts. 44 3.5.3 Comparison to DirkNB [6] Organization Here, ACKwise is compared to an alternate limited-directory organization called DirkNB. DirkNB only tracks the unique identities of k sharers like ACKwisek. However, if the number of sharers exceeds the number of hardware pointers k, an existing sharer is invalidated to make space for the new sharer. This technique always ensures that all the sharers of a cache line can be accommodated within the hardware pointers, thereby not requiring support for broadcasts. However, this organization does not work well for cache lines that are widely shared with significant number of read accesses since DirkNB can only accommodate a maximum of k readers before invalidating other readers. * Instructions * Branch Speculation E 6 Compute Stalls Li-I Fetch Stalls M Synchronization U Memory Stalls Idle 100 80 0 60 ELI S40 20 VOLREND RAYTRACE OCEAN-NC FFT RADX LU-C LU-Nc CANNEAL AVERAGE Figure 3-7: Completion time of the DirkNB protocol when k is varied as 2, 4, 8 & 16. Results are normalized to the ACKwise 4 protocol. Figures 3-7 plot the completion time and energy of DirkNB as a function of k. Results are normalized to the ACKwise 4 protocol. The value k is varied from 2 to 16 in powers of 2. The DirkNB protocol performs poorly and exhibits high Li-I (instruction) cache stall time due to the wide sharing of instructions between program threads. Memory stalls are found to be significant in the RADIX benchmark as well due to the presence of widely-shared read-mostly data. Completion time reduces as k is increased since more sharers can be accommodated before the need for invalidation. However, the ACKwise 4 protocol performs at least an order of magnitude better than all DirkNB variants. 45 3.6 Summary This chapter proposes the ACKwise limited-directory based coherence protocol. ACKwise, by using a limited set of hardware pointers to track the sharers of a cache line, incurs much less area and energy overhead than the conventional directory-based protocol. If the number of sharers exceeds the number of hardware pointers, ACKwise does not track the identities of the sharers anymore. Instead, it tracks the number of sharers. On an exclusive request, an invalidation request is broadcast to all the cores. However, acknowledgements need to be sent by only the actual sharers since the number of sharers is tracked. The invalidation broadcast is handled efficiently by making simple changes to an electrical mesh network. Evaluations on a 256-core multicore show that ACKwise 4 incurs only 16 KB storage per core compared to the 128 KB storage required for the full-map directory. ACKwise 4 also matches the performance of the full-map directory while expending 3.5% less energy. 46 Chapter 4 Locality-aware Private Cache Replication Motivation 4.1 01 02,3 E4,5 6,7 E>=8 100% 80% CJ 60% 40%2 20% -- - S0% C 0o Figure 4-1: Invalidations vs Utilization. First, the need for a locality-aware allocation of cache lines in the private caches of a shared memory multicore processor is motivated. The locality of each cache line is quantified using a utilization metric. The utilization is defined as the number of accesses that are made to the cache line by a core after being brought into its private 47 NJ 0 2,3 ,5 E6,7 B>=8 100% 80% W 20% 0 % - - 40% F - . -, o0 F _' f - AV - 0 C % c~ C, '0 ,O Figure 4-2: Evictions vs Utilization. cache hierarchy and before being invalidated or evicted. Figures 4-1 and 4-2 show the percentage of invalidated and evicted cache lines as a function of their utilization. We observe that many cache lines that are evicted or invalidated in the private caches exhibit low locality (e.g., in STREAMCLUSTER, 80% of the cache lines that are invalidated have utilization < 4). To avoid the performance penalties of invalidations and evictions, we propose to only bring those cache lines that have high spatio-temporal locality into the private caches and not replicate those with low locality. This is accomplished by tracking vital statistics at the private caches and the on-chip directory to quantify the utilization of data at the granularity of cache lines. This utilization information is subsequently used to classify data as cache line or word accessible. 4.2 Protocol Operation We first define a few terms to facilitate describing our protocol. " Private Sharer: A private sharer is a core which is granted a private copy of a cache line in its Li cache. " Remote Sharer: A remote sharer is a core which is NOT granted a private copy of a cache line. Instead, its Li cache miss is handled at the shared L2 48 Core Compute Pipeline Li D-Cache Home Li I-Cache HomeL L2 Shared Cache Directory I Mem Router Figure 4-3: (, g and @ are mockup requests showing the two modes of accessing on-chip caches using our locality-aware protocol. Since the black data block has high locality with respect to the core at T, the directory at the home-node hands out a private copy of the cache line. On the other hand, the low-locality red data block is always cached in a single location at its home-node, and all requests (@, @) are serviced using roundtrip remote-word accesses. cache location using word access. " Private Utilization: Private utilization is the number of times a cache line is used (read or written) by a core in its private LI cache before it gets invalidated or evicted. * Remote Utilization: Remote utilization is the number of times a cache line is used (read or written) by a core at the shared L2 cache before it is brought into its LI cache or gets written to by another core. " Private Caching Threshold (PCT): The private utilization above or equal to which a core is "promoted" to be a private sharer, and below which a core is "demoted" to be a remote sharer of a cache line. Note that a cache line can have both private and remote sharers. We first describe the basic operation of the protocol. Later, we present heuristics that are essential for a cost-efficient hardware implementation. Our protocol starts out as a conventional directory protocol and initializes all cores as private sharers of all cache lines (as shown by Initial in Figure 4-4). Let us understand the handling of read and write 49 Initial Utilization < PCT Utilization >= PCT Utilization < PCT Utilization >= PCT Figure 4-4: Each cache line is initialized to Private with respect to all sharers. Based on the utilization counters that are updated on each memory access to this cache line and the parameter PCT, the sharers are transitioned between Private and Remote modes. Here utilization = (private + remote) utilization. requests under this protocol. 4.2.1 Read Requests When a core makes a read request and misses in its private Li cache, the request is sent to the L2 cache. If the cache line is not present in the.L2 cache, it is brought in from off-chip memory. The L2 cache then hands out a private read-only copy of the cache line if the core is marked as a private sharer in its integrated directory (0 in Figure 4-3). (Note that when a cache line is brought in from off-chip memory, all cores start out as private sharers). The core then tracks the locality of the cache line by initializing a private utilization counter in its Li cache to 1 and incrementing this counter for every subsequent read. Each cache line tag is extended with utilization tracking bits for this purpose as shown in Figure 4-5. State LR U Tag Figure 4-5: Each Li cache tag is extended to include additional bits for tracking (a) private utilization, and (b) last-access time of the cache line. On the other hand, if the core is marked as a remote sharer, the integrated directory either increments a core-specific remote utilization counter or resets it to 1 based on the outcome of a Timestamp check (that is described below). If the remote utilization counter has reached PCT, the requesting core is promoted, i.e., marked 50 State ACKwise Pointers Tag 1..p Figure 4-6: A CKwisep - Complete classifier directory entry. The directory entry contains the state, tag, ACKwise, pointers as well as (a) mode (P/R), (b) remote utilization counters and (c) last-access timestamps for tracking the locality of all the cores in the system. as a private sharer and a copy of the cache line is handed over to it (as shown in Figure 4-4). Otherwise, the L2 cache replies with the requested word (g and @ in Figure 4-3). The Timestamp check that must be satisfied to increment the remote utilization counter is as follows. The last-access time of the cache line in the L2 cache is greater than the minimum of the last-access times of all valid cache lines in the same set of the requesting core's Li cache. Note that if at least one cache line is invalid in the Li cache, the above condition is trivially true. Each directory entry is augmented with a per-core remote utilization counter and a last-access timestamp (64-bits wide) for this purpose as shown in Figure 4-6. Each LI cache tag also contains a last-access timestamp (shown in Figure 4-5) and this information is used to calculate the above minimum last-access time in the LI cache. This minimum is then communicated to the L2 cache on an Li miss. The above Timestamp check is added so that when a cache line is brought into the Li cache, other cache lines that are equally or better utilized are not evicted, i.e., the cache line does not cause LI cache pollution. For example, consider a benchmark that is looping through a data structure with low locality. Applying the above Timestamp check allows the system to keep a subset of the working set in the LI cache. Without the Timestamp check, a remote sharer would be promoted to be a private sharer after a few accesses (even if the other lines in the LI cache are well utilized). This would result in cache lines evicting each other and ping-ponging between the LI and L2 51 caches. 4.2.2 Write Requests When a core makes a write request that misses in its private Li cache, the request is sent to the L2 cache. The directory performs the following actions if the core is marked as a private sharer: (1) it invalidates all the private sharers of the cache line, (2) it sets the remote utilization counters of all its remote sharers to '0', and (3) it hands out a private read-write copy of the line to the requesting core. The core then tracks the locality of the cache line by initializing the private utilization counter in its Li cache to 1 and incrementing this counter on every subsequent read/write request. On the other hand, if the core is marked as a remote sharer, the directory performs the following actions: (1) it invalidates all the private sharers, (2) it sets the remote utilization counters of all remote sharers other than the requesting core to '0', and (3) it increments the remote utilization counter for the requesting core, or resets it to 1 using the same Timestamp check as described earlier for read requests. If the utilization counter has reached PCT, the requesting core is promoted and a private read-write copy of the cache line is handed over to it. Otherwise, the word to be written is stored in the L2 cache. When a core writes to a cache line, the utilization counters of all remote sharers (other than the writer itself) must be set to '0' since they have been unable to demonstrate enough utilization to be promoted. All remote sharers must now build up utilization again to be promoted. 4.2.3 Evictions and Invalidations When the cache line is removed from the private Li cache due to eviction (conflict or capacity miss) or invalidation (exclusive request by another core), the private utilization counter is communicated to the directory along with the acknowledgement message. The directory uses this information along with the remote utilization counter present locally to classify the core as a private or remote sharer in order to handle 52 future requests. If the (private + remote) utilization is > PCT, the core stays as a private sharer, else it is demoted to a remote sharer (as shown in Figure 4-4). The remote utilization is added because if the cache line had been brought into the private Li cache at the time its remote utilization was reset to 1, it would not have been evicted (due to the Timestamp check and the LRU replacement policy of the Li cache) or invalidated any earlier. Therefore, the actual utilization observed during this classification phase includes both the private and remote utilization. Performing classification using the mechanisms described above is expensive due to the area overhead, both at the Li cache and the directory. Each Li cache tag needs to store the private utilization and last-access timestamp (a 64-bit field). And each directory entry needs to track locality information (i.e., mode, remote utilization and last-access timestamp) for all the cores. We now describe heuristics to facilitate a cost-effective hardware implementation. In Section 4.3, we remove the need for tracking the last-access time from the Li cache and the directory. The basic idea is to approximate the outcome of the Timestamp check by using a threshold higher than PCT for switching from remote to private mode. This threshold, termed Remote Access Threshold (RAT) is dynamically learned by observing the Li cache set pressure and switches between multiple levels so as to optimize energy and performance. In Section 4.4, we describe a mechanism for predicting the mode (private/ remote) of a core by tracking locality information for only a limited number of cores at the directory. 4.3 Predicting Remote-Private Transitions The Timestamp check described earlier served to preserve well utilized lines in the Li cache. We approximate this mechanism by making the following two changes to the protocol: (1) de-coupling the threshold for remote-to-private mode transition from that for private-to-remote transition, (2) dynamically adjusting this threshold based 53 on the observed Li cache set pressure. The threshold for remote-to-private mode transition, i.e., the number of accesses at which a core transitions from a remote to private sharer, is termed Remote Access Threshold (RAT). Initially, RAT is set equal to PCT (the threshold for private-toremote mode transition). On an invalidation, if the core is classified as a remote sharer, RAT is unchanged. This is because the cache set has an invalid line immediately following an invalidation leading to low cache set pressure. Hence, we can assume that the Timestamp check trivially passes on every remote access. However, on an eviction, if the core is demoted to a remote sharer, RAT is increased to a higher level. This is because an eviction signifies higher cache set pressure. By increasing RAT to a higher level, it becomes harder for the core to be promoted to a private sharer, thereby counteracting the cache set pressure. If there are backto-back evictions, with the core demoted to a remote sharer on each of them, RAT is further increased to higher levels. However, RAT is not increased beyond a certain value (RATmax) due to the following two reasons: (1) the core should be able to return to the status of a private sharer if it later shows good locality, and (2) the number of bits needed to track remote utilization should not be too high. Also, beyond a particular RAT, keeping the core as a remote sharer counteracts the increased cache pressure negligibly, leading to only small improvements in performance and energy. The protocol is also equipped with a short-cut in case an invalid cache line exists in the Li cache. In this case, if remote utilization reaches or rises above PCT, the requesting core is promoted to a private sharer since it will not cause cache pollution. The number of RAT levels used is abbreviated as nRA Teveis. RAT is additively increased in equal steps from PCT to RATmax, the number of steps being equal to ( nRATievels -1). On the other hand, if the core is classified as a private sharer on an eviction or invalidation, RAT is reset to its starting value of PCT. Doing this is essential because it provides the core the opportunity to re-learn its classification. Varying the RAT in this manner removes the need to track the last-access time both in the Li cache tag and the directory. However, a field that identifies the current RAT-level now needs to 54 State ACKwise Pointers Tag 1..p Figure 4-7: The limited locality classifier extends the directory entry with mode, utilization, and RAT-level bits for a limited number of cores. A majority vote of the modes of tracked cores is used to classify new cores as private or remote sharers. be added to each directory entry. These bits now replace the last-access timestamp field in Figure 4-6. The efficacy of this scheme is evaluated in Section 4.11.3. Based on our observations, RATma = 16 and nRATeveis = 2 were found to produce results that closely match those produced by the Timestamp-based classification scheme. 4.4 Limited Locality Classifier The classifier described earlier which keeps track of locality information for all cores in the directory entry is termed the Complete locality classifier. It has a storage overhead of 60% (calculated in Section 6.2.3) at 64 cores and over 10x at 1024 cores. In order to mitigate this overhead, we develop a classifier that maintains locality information for a limited number of cores and classifies the other cores as private or remote sharers based on this information. The locality information for each core consists of (1) the core ID, (2) a mode bit (P/R), (3) a remote utilization counter, and (4) a RAT-level. The classifier that maintains a list of this information for a limited number of cores (k) is termed the Limitedk classifier. Figure 4-7 shows the information that is tracked by this classifier. The sharer list of ACKwise is not reused for tracking locality information because of its different functionality. While the hardware pointers of ACKwise are used to maintain coherence, the limited locality list serves to classify cores as private or remote sharers. Decoupling in this manner also enables the locality-aware protocol to 55 be implemented efficiently on top of other scalable directory organizations. However, the locality-aware protocol enables ACKwise to be implemented more efficiently as will be described in Section 4.8. We now describe the working of the limited locality classifier. At startup, all entries in the limited locality list are free and this is denoted by marking all core IDs' as INVALID. When a core makes a request to the L2 cache, the directory first checks if the core is already being tracked by the limited locality list. If so, the actions described in Section 4.2 are carried out. Else, the directory checks if a free entry exists. If it does exist, it allocates the entry to the core and the actions described in Section 4.2 are carried out. Otherwise, the directory checks if a currently tracked core can be replaced. An ideal candidate for replacement is a core that is currently not using the cache line. Such a core is termed an inactive sharer and should ideally relinquish its entry to a core in need of it. A private sharer becomes inactive on an invalidation or an eviction. A remote sharer becomes inactive on a write by another core. If a replacement candidate exists, its entry is allocated to the requesting core. The initial mode of the core is obtained by taking a majority vote of the modes of the tracked cores. This is done so as to start off the requester in its most probable mode. Finally, if no replacement candidate exists, the mode for the requesting core is obtained by taking a majority vote of the modes of all the tracked cores. The limited locality list is left unchanged. The storage overhead for the Limited, classifier is directly proportional to the number of cores (k) for which locality information is tracked. In Section 4.11.4, we will evaluate the accuracy of the Limited, classifier. Based on our observations, the Limited3 classifier produces results that closely match and sometimes exceeds those produced by the Complete classifier. 56 4.5 Selection of PCT The Private Caching Threshold (PCT) is a parameter to our protocol and combining it with the observed spatio-temporal locality of cache lines, our protocol classifies data as private or remote. The extent to which our protocol improves the performance and energy consumption of the system is a complex function of the application characteristics, the most important being its working set and data sharing and access patterns. In Section 4.11, we will describe how these factors influence performance and energy consumption as PCT is varied for the evaluated benchmarks. We will also show that choosing a static PCT of 4 for the simulated benchmarks meets our performance and energy consumption improvement goals. 4.6 4.6.1 Overheads of the Locality-Based Protocol Storage The locality-aware protocol requires extra bits at the directory and private caches to track locality information. At the private Li cache, tracking locality requires 2 bits for the private utilization counter per cache line (assuming an optimal PCT of 4). At the directory, the Limited3 classifier tracks locality information for three sharers. Tracking one sharer requires 4 bits to store the remote utilization counter (assuming an RATm, of 16), 1 bit to store the mode, 1 bit to store the RAT-level (assuming 2 RAT levels) and 6 bits to store the core ID (for a 64-core processor). the Limited3 classifier requires an additional 36 (= Hence, 3 x 12) bits of information per directory entry. The Complete classifier, on the other hand, requires 384 (= 64 x 6) bits of information. The assumptions stated here will be justified in the evaluation section. The following calculations are for one core but they are applicable for the entire system since all cores are identical. The sizes for the per-core Li-I, Li-D and L2 caches used in our system are shown in Table 4.1. The directory is integrated with the L2 cache, so each L2 cache line has an associated directory entry. The storage overhead 57 Li-I Cache 16 KB L1-D Cache 32 KB L2 Cache 256 KB Total 304 KB Line Size 64 bytes Table 4.1: Cache sizes per core. Full-Map ACKwise 4 32 KB 336 KB 12 KB 316 KB Directory Overall ACKwise 4 Complete 236 KB 540 KB ACKwise 4 Limited 3 32 KB 336 KB Table 4.2: Storage required for the caches and coherence protocol per core. in the Li-I and L1-D caches is 2 512 x (16 + 32) = 0.19KB. We neglect this in future calculations since it is really small. The storage overhead in the directory for the Limited3 classifier is 3 256 18KB. For the Complete classifier, it is 192KB. Now, the storage required for the ACKwise 4 protocol in this processor is 12KB (assuming 24 bits per directory entry) and that for the Full-Map protocol is 32KB. Adding up all the storage components, the Limited 3 classifier with the A CKwise 4 protocol uses less storage than the Full-map protocol and 5.7% more storage than the baseline ACKwise 4 protocol (factoring in the Li-I, L1-D and L2 cache sizes also). The Complete classifier with ACKwise 4 uses 60% more storage than the baseline ACKwise 4 protocol. Table 4.2 shows the storage overheads of the protocols of interest. 4.6.2 Cache & Directory Accesses Updating the private utilization counter in a cache requires a read-modify-write operation on every cache hit. This is true even if the cache access is a read. However, the utilization counter, being just 2 bits in length, can be stored in the tag array. Since the tag array already needs to be written on every cache hit to update the replacement policy (e.g., LRU) counters, our protocol does not incur any additional cache accesses. On the directory side, lookup/update of the locality information can be performed during the same operation as the lookup/update of the sharer list for a particular cache line. However, the lookup/update of the directory entry is now more expensive since it includes both the sharer list and the locality information. 58 This additional expense has been accounted for in the evaluation. 4.6.3 Network Traffic The locality-aware protocol could create network traffic overhead due to the following three reasons: 1. The private utilization counter has to be sent along with the acknowledgement to the directory on an invalidation or an eviction. 2. In addition to the cache line address, the cache line offset and the memory access length has to be communicated during every cache miss. This is because the requester does not know whether it is a private or remote sharer (only the directory maintains this information as explained previously). 3. The data word(s) to be written has (have) to be communicated on every cache miss due to the same reason. Some of these overheads can be hidden while others are accounted for during evaluation as described below. 1. Sending back the utilization counter can be accomplished without creating additional network flits. For a 48-bit physical address (standard for most 64-bit x86 processors) and 64-bit flit size, an invalidation message requires 42 bits for the physical cache line address, 12 bits for the sender and receiver core IDs and 2 bits for the utilization counter. The remaining 8 bits suffice for storing the message type. 2. The cache line offset needs to be communicated but not the memory access length. We profiled the memory access lengths for the benchmarks evaluated and found it to be 64 bits in the common case. (Note that 64 bits is the processor word size.) Memory accesses that are < 64 bits in length are rounded-up to 64 bits while those > 64 bits always fetch an entire cache line. Only 1 bit is needed to indicate this difference. Hence, a request to the directory on a cache miss 59 uses 48 bits for the physical address (including offset), and 12 bits for the sender and receiver core IDs. The remaining 4 bits suffice for storing the message type. 3. The data word to be written (64 bits in length) is always communicated to the directory on a write miss in the Li cache. This overhead is accounted for in our evaluation. 4.6.4 Transitioning between Private/Remote Modes In our protocol no additional coherence protocol related messages are invoked when a sharer transitions between private and remote modes. 4.7 Simpler One-Way Transition Protocol The complexity of the above protocol could be decreased if cores, once they were classified as remote sharers w.r.t. a cache line would stay in the same mode throughout the program. If this were true, the storage required to track locality information at the directory could be avoided except for the mode bits. The bits to track private utilization at the cache tags would still be required to demote a core to the status of a remote sharer. We term this simpler protocol Adapt, -,,, and in Section 4.11.5, we observe that this protocol is worse than the original protocol by 34% in completion time and 13% in energy. Hence, a protocol that incorporates dynamic transitioning between both modes is required for efficient operation. 4.8 Synergy with ACKwise The locality-aware coherence protocol reduces the number of private sharers of lowlocality cache lines and increases the number of remote sharers. This is beneficial for the ACKwise protocol since it reduces the number of invalidations as well as the number of overflows of the limited directory. Section 4.11.6 evaluates this synergy. 60 4.9 Potential Advantages of Locality-Aware Cache Coherence The locality-aware coherence protocol has the following key advantages over the conventional private-caching protocols. 1. By allocating cache lines only for high locality sharers, the protocol prevents the pollution of caches with low locality data and makes better use of their capacity. 2. It reduces overall system energy consumption by reducing the amount of network traffic and cache accesses. The network traffic is reduced by removing invalidation, flush and eviction traffic for low locality data as well as by returning/storing only a word instead of an entire cache line. 3. Removing invalidation and flush messages and returning a word instead of a cache line also improves the average memory latency. 4.10 Evaluation Methodology We evaluate a 64-core multicore using in-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The parameters specific to the locality-aware cache coherence protocol are shown in Table 4.3. Architectural Parameter value Private Caching Threshold PCT = 4 Max Remote Access Threshold RA TmaX = 16 Number of RAT Levels Classifier nRATevels = 2 Limited 3 Table 4.3: Locality-aware cache coherence protocol parameters 61 4.10.1 Evaluation Metrics Each multithreaded benchmark is run to completion using the input sets from Table 2.3. We measure the energy consumption of the memory system including the on-chip caches, and the network. For each simulation run, we measure the Completion Time, i.e., the time in parallel region of the benchmark; this includes the compute latency, the memory access latency, and the synchronization latency. The memory access latency is further broken down into four components. 1. Li to L2 cache latency is the time spent by the LI cache miss request to the L2 cache and the corresponding reply from the L2 cache including time spent in the network and the first access to the L2 cache. 2. L2 cache waiting time is the queueing delay incurred because requests to the same cache line must be serialized to ensure memory consistency. 3. L2 cache to sharers latency is the round-trip time needed to invalidate private sharers and receive their acknowledgments. This also includes time spent requesting and receiving synchronous write-backs. 4. L2 cache to off-chip memory latency is the time spent accessing memory including the time spent communicating with the memory controller and the queueing delay incurred due to finite off-chip bandwidth. We also measure the energy consumption of the memory system which includes the on-chip caches and the network. Our goal with the A CKwise and the Localityaware Private Cache Replication protocol is to minimize both Completion Time and Energy Consumption. One of the important memory system metrics we track to evaluate our protocol is the various cache miss types. They are as follows: 1. Cold misses are cache misses that occur to a line that has never been previously brought into the cache. 62 2. Capacity misses are cache misses to a line that was brought in previously but later evicted to make room for another line. 3. Upgrade misses are cache misses to a line in read-only state when an exclusive request is made for it. 4. Sharing misses are cache misses to a line that was brought in previously but was invalidated or downgraded due to an read/write request by another core. 5. Word misses are cache misses to a line that was remotely accessed previously. 4.11 Results The architectural parameters from Table 2.1 are used for the study unless otherwise stated. In Section 4.11.1, a sweep study is performed to understand the trends in Energy and Completion Time for the evaluated benchmarks as PCT (Private Caching Threshold) is varied. In Section 4.11.3, the approximation scheme for the Timestampbased classification is evaluated and the optimal number of remote access threshold levels (nRA Teves) and maximum RAT threshold (RA Tmax) is determined. Next, in Section 4.11.4, the accuracy of the limited locality tracking classifier (Limitedk) is evaluated by performing a sensitivity analysis on k. Section 4.11.5 compares the Energy and Completion Time of the locality-aware protocol against the simpler oneway transition protocol (Adaptway). The best value of PCT obtained in Section 4.11.1 is used for the experiments in Sections 4.11.3, 4.11.4, and 4.11.5. Section 4.11.6 evaluates the synergy between the ACKwise and the locality-aware coherence protocol. 4.11.1 Energy and Completion Time Trends Figures 6-8 and 6-9 plot the energy and completion time of the evaluated benchmarks as a function of PCT. Results are normalized in both cases to a PCT of 1 which corresponds to the baseline R-NUCA system with the ACKwise 4 directory protocol. 63 M L1-D Cache H LU-1 Cache L2 Cache U Directory Network Router N Network Link 1.2 0 I..EI AM. C 111.8 12345:678 I I.lll IW 1 123:45678 1231456718 :1234'5678 1.2:34567~8 12345678 1234567'8 OCEAN-NC WATER-SP RAYTRACE BLACKSCH. STREAMCLUS. DEDUP 2l3451678 1234678 123456718 12345678 CONCOMP COMMUNITY RADIX LU-NC BARNES 12345678 12345678 123456 78 CANNEAL DUKSTRA-SS DIJKSTRA-AP 1:234'5-678 12,345678 123i4I5678 12I345678 BODYTRACK FLUIDANIM. 1.2 0 LJ 12345678 12 34567 8, PATRICIA SUSAN TSP DS 1 23456 78 MATMUL 1,2345-6178 AVERAGE 4-8: Variation of Energy with PCT. Results are normalized to a PCT of 1. Figure 0.6 Note that Average and not Geometric-Mean is plotted here. 8m Compute U LlCache-L2Cache L2Cache-Waiting L2Cache-Sharers L2Cache-OffChip U Synchronization 1.4 RADIX LU-NC BARNES OCEAN-NC RAYTRACE BLACKSCH. STREAMCLUS. DEDUP 12345678 123415678 12345678 123456718 WATER-SP BODYTRACK FLUIDANIM. 1.2 12345:678 123456718 CANNEAL 123415678 DIJKSTRASSDIJKSTRA-AP 1234:67T8 12345678 PATRICIA CONCOMP SUSAN COMMUNITY TSP DFS 12345678 MATMUL 234:5:678 AVERAGE Figure 4-9: Variation of Completion Time with PCT. Results are normalized to a PCT of 1. Note that Average and not Geometric-Mean is plotted here. Both energy and completion time decrease initially as POT is increased. As POT increases to higher values, both completion time and energy start to increase. Energy We consider the impact that our protocol has on the energy consumption of the memory system (L-I cache, L1-D cache, Lacheche and directory) and the interconnection 64 Cold E Capacity 0 Upgrade 1 Sharing 3 0.4 03 0 -o U Word 15 2.5 2 1.5 0.2 10 0.1 I% 123468 WATER-SP 123468 123468 SUSAN BLACKSCH 123468 FLUIDANIM. 12 3 4 68 CANNEAL 12 34 6 8 123141618 COMMUNITY, RAYTRACE 1234I68 123468 123468 PATRICIA DFS TSP 8 40 60 6 30 40 04 20 C0 10 j 30 0 1234468 RADIX 1 2*618 BARNES 12.34 STREAMCLUS. 1213468 DEDUP 1:23 46.8 12314 618 BODYTRACK DIJKSTRA-SS 1234 68 DIJKSTRAAP 123468 LU-NC 1234168 123468: 123468 OCEAN-NC MATMUL CONCOMP Figure 4-10: Li Data Cache Miss Rate and Miss Type Breakdown vs PCT. Note that in this graph, the miss rate increases from left to right as well as from top to bottom. network (both router and link). The distribution of energy between the caches and network varies across benchmarks and is primarily dependent on the L1-D cache miss rate. For this purpose, the L1-D cache miss rate is plotted along with miss type breakdowns in Figure 4-10. Benchmarks such as WATER-SP and SUSAN with low cache miss rates (-0.2%) dissipate 95% of their energy in the Li caches while those such as CONCOMP and LUNC with higher cache miss rates dissipate more than half of their energy in the network. The energy consumption of the L2 cache compared to the Li-I and L1-D caches is also highly dependent on the L1-D cache miss rate. For example, WATER-SP has negligible L2 cache energy consumption, while OCEAN-NC's L2 energy consumption is more than its combined Li-I and L1-D energy consumption. At the 11nm technology node, network links have a higher contribution to the energy consumption than network routers. This can be attributed to the poor scaling trends of wires compared to transistors. As shown in Figure 6-8, this trend is observed in all our evaluated benchmarks. The energy consumption of the directory is negligible compared to all other sources of energy consumption. This motivated our decision to put the directory in the L2 65 cache tag arrays as described earlier. The additional bits required to track locality information at the directory have a negligible effect on energy consumption. Varying PCT impacts energy by changing both network traffic and cache accesses. In particular, increasing the value of PCT decreases the number of private sharers of a cache line and increases the number of remote sharers. This impacts the network traffic and cache accesses in the following three ways. (1) Fetching an entire line on a cache miss in conventional coherence protocols is replaced by multiple word accesses to the shared L2 cache. Note that each word access at the shared L2 cache requires a lookup and an update to the utilization counter in the directory as well. (2) Reducing the number of private sharers decreases the number of invalidations (and acknowledgments) required to keep all cached copies of a line coherent. Synchronous write-back requests that are needed to fetch the most recent copy of a line are reduced as well. (3) Since the caching of low-locality data is eliminated, the Li cache space can be more effectively used for high locality data, thereby decreasing the amount of asynchronous evictions (leading to capacity misses) for such data. Benchmarks that yield a significant improvement in energy consumption do so by converting either capacity misses (in BODYTRACK and BLACKSCHOLES) or sharing misses (in DIJKSTRA-SS and STREAMCLUSTER) into cheaper word misses. This can be observed from Figure 6-8 when going from a PCT of 1 to 2 in BODYTRACK and BLACKSCHOLES and a PCT of 2 to 3 in DIJKSTRA-SS and STREAMCLUSTER. While a sharing miss is more expensive than a capacity miss due to the additional network traffic generated by invalidations and synchronous write-backs, turning capacity misses into word misses improves cache utilization (and reduces cache pollution) by reducing evictions and thereby capacity misses for other cache lines. This is evident in benchmarks like BLACKSCHOLES, BODYTRACK, DIJKSTRA-AP and MATMUL in which the cache miss rate drops when switching from a PCT of 1 to 2. Benchmarks like LU-NC, and PATRICIA provide energy benefit by converting both capacity and sharing misses into word misses. At a PCT of 4, the geometric mean of the energy consumption across all benchmarks is less than that at a PCT of 1 by 25%. 66 Completion Time As shown in Figure 6-9, our protocol reduces the completion time as well. Noticeable improvements (>5%) occur in 11 out of the 21 evaluated benchmarks. Most of the improvements occur for the same reasons as discussed for energy, and can be attributed to our protocol identifying low locality cache lines and converting the capacity and sharing misses on them to cheaper word misses. Benchmarks such as BLACKSCHOLES, DIJKSTRA-AP and MATMUL experience a lower miss rate when PCT is increased from 1 to 2 due to better cache utilization. This translates into a lower completion time. In CONCOMP, cache utilization does not improve but capacity misses are converted into almost an equal number of word misses. Hence, the completion time improves. Benchmarks such as STREAMCLUSTER and TSP show completion time improvement due to converting expensive sharing misses into word misses. From a performance standpoint, sharing misses are expensive because they increase: (1) the L2 cache to sharers latency and (2) the L2 cache waiting time. Note that the L2 cache waiting time of one core may depend on the L2 cache to sharers latency of another since requests to the same cache line need to be serialized. In these benchmarks, even if cache miss rate increases with PCT, the miss penalty is lower because a word miss is much cheaper than a sharing miss. A word miss does not contribute to the L2 cache to sharers latency and only contributes marginally to the L2 cache waiting time. Hence, the above two memory access latency components can be significantly reduced. Reducing these components may decrease synchronization time as well if the responsible memory accesses lie within the critical section. STREAMCLUSTER and DIJKSTRA-SS mostly reduce the L2 cache waiting time while PATRICIA and TSP reduce the L2 cache to sharers latency. In a few benchmarks such as LU-NC and BARNES, completion time is found to increase after a PCT of 3 because the added number of word misses overwhelms any improvement obtained by reducing capacity misses. At a PCT of 4, the geometric mean of the completion time across all benchmarks is less than that at a PCT of 1 by 15%. 67 *Energy -Completion Time 1.2 1.1 1 0.9 0.8 0.7 0.6 1 2 3 4 5 6 8 7 10 12 14 16 18 20 Private Caching Threshold (PCT) Figure 4-11: Variation of Geometric-Means of Completion Time and Energy with Private Caching Threshold (PCT). Results are normalized to a PCT of 1. 4.11.2 Static Selection of PCT To put everything in perspective, we plot the geometric means of the Completion Time and Energy for our benchmarks in Figure 4-11. We observe a gradual decrease of completion time up to PCT of 3, constant completion time till a PCT of 4, and an increase in completion time afterward. Energy consumption decreases up to PCT of 5, then stays constant till a PCT of 8 and after that, it starts increasing. We conclude that a PCTof 4 meets our goal of simultaneously improving both completion time and energy consumption. A completion time reduction of 15% and an energy consumption improvement of 25% is obtained when moving from PCT of 1 to 4. 4.11.3 Tuning Remote Access Thresholds -&Energy -. Completion-Time 1.13 1.08 1.03 0.98 Timestamp Figure 4-12: RATmax (T) L-1 L-2, T-8 L-2, T-16 L-4, T-8 L-4, T-16 L-8, T-16 Remote Access Threshold sensitivity study for nRATeveis (L) and 68 As explained in Section 4.3, the Timestamp-based classification scheme was expensive due to its area overhead, both at the Li cache and directory. The benefits provided by this scheme can be approximated by having multiple Remote Access Threshold (RAT) Levels and dynamically switching between them at runtime to counteract the increased Li cache pressure. We now perform a study to determine the optimal number of threshold levels (nRATeveis) and the maximum threshold (RA Tmax). Figure 4-12 plots the completion time and energy consumption for the different points of interest. The results are normalized to that of the Timestamp-based classification scheme. The completion time is almost constant throughout. However, the energy consumption is nearly 9% higher when nRA Teves = 1. With multiple RAT levels (nRATevei 8 > 1), the energy is significantly reduced. Also, the energy con- sumption with RA Tmax= 16 is found to be slightly lower (2%) than with RATmax = 8. With RATmax = 16, there is almost no difference between nRA Teves = 2, 4, 8, so we choose 4.11.4 nRATievei, = 2 since it minimizes the area overhead. Limited Locality Tracking As explained in Section 4.4, tracking the locality information for all the cores in the directory results in an area overhead of 60% per core. So, we explore a mechanism that tracks the locality information for only a few cores, and classifies a new core as a private or remote sharer based on a majority vote of the modes of the tracked cores. Figure 4-13 plots the completion time and energy of the benchmarks with the Limitedk classifier when k is varied as (1, 3, 5, 7, 64). k = 64 corresponds to the Complete classifier. The results are normalized to that of the Complete classifier. The benchmarks that are not shown are identical to WATER-SP, i.e., the completion time and energy stay constant as k varies. The experiments are run with the best static PCT value of 4 obtained in Section 4.11.2. We observe that the completion time and energy consumption of the Limited 3 classifier never exceeds by more than 3% the completion time and energy consumption of the Complete classifier. In STREAMCLUSTER and DIJKSTRA-SS, the Limited 3 classifier does better than the Complete classifier because it learns the mode of sharers quicker. 69 While the 0 Limited-1 I @1 E 0 E M Limited-5 0 Limited-7 N Complete 8Limited-3 U Limited-5 U Limited-7 1 00m S0.8 0.6 NA E Limited-3 1.2 M 0.4 E_0.2 0 0 M Limitehd-1 U Complete 0 1.4 1.2 1 Ui 0.8 0.6 0.4 0.2 0 +S. AR tv, 2 $ qw Figure 4-13: Variation of Completion Time and Energy with the number of hardware locality counters (k) in the Limitedk classifier. Limited 64 is identical to the Complete classifier. Benchmarks for which results are not shown are identical to WATER-SP, i.e., the Completion Time and Energy stay constant as k varies. Complete classifier starts off each sharer of a cache line independently in private mode, the Limited 3 classifier infers the mode of a new sharer from the modes of existing sharers. This enables the Limited 3 classifier to put the new sharer in remote mode without the initial per-sharer classification phase. We note that the Complete locality classifier can also be equipped with such a learning short-cut. Inferring the modes of new sharers from the modes of existing sharers can be harmful too, as is illustrated in the case of RADIX and BODYTRACK for the Limited1 classifier. The cache miss rate breakdowns for these benchmarks as the number of locality counters is varied is shown in Figure 4-14. While RADIX starts off new sharers incorrectly in remote mode, BODYTRACK starts them off incorrectly in private mode. This is because the first sharer is classified as remote (in RADIX) and private (in BODYTRACK). This causes other sharers to also reside in that mode while they 70 0 Cold M Capacity M Upgrade 6 Sharing C Cold U Capacity U Upgrade Word 8 4 6 3, 6~~~G U Sharing Word ----- 4 2 00 1 3 7 5 3 1 64 5 7 64 # Hardware Locality Counters # Hardware Locality Counters (b) (a) RADIX BODYTRACK Figure 4-14: Cache miss rate breakdown variation with the number of hardware locality counters (k) in the Limitedk classifier. Limited6 4 is identical to the Complete classifier. actually want to be in the opposite mode. Our observation from the above sensitivity experiment is that tracking the locality information for three sharers suffices to offset such incorrect classifications. 4.11.5 Simpler One-Way Transition Protocol U Completion Time 3 2 U Energy 2 1.5 # & 0 Figure 4-15: Ratio of Completion Time and Energy of Adapt,,a, over Adapt2 way In order to quantify the efficacy of the dynamic nature of our protocol, we compare the protocol to a simpler version having only one-way transitions (Adaptway). The simpler version starts off all cores as private sharers and demotes them to remote sharers when the utilization is less than the private caching threshold (PCT). However, these cores then stay as remote sharers throughout the lifetime of the program and can never be promoted. The experiment is run with the best PCT value of 4. Figure 4-15 plots the ratio of completion time and energy for the Adaptway pro71 tocol over our protocol (which we term Adapt2 -way). Higher the ratio, higher is the We observe that the Adapt,,a, protocol is worse in need for two-way transitions. completion time and energy by 34% and 13% respectively. In benchmarks such as BODYTRACK and DIJKSTRA-SS, the completion time ratio is worse by 3.3x and 2.3x respectively. 4.11.6 Synergy with ACKwise 6 -0-radix 5 -++-lu-contig *-Iu-noncontig 0~- -*-ocean-contig -barnes 3 3 2 1 0 1 2 3 4 5 7 6 9 8 10 11 12 13 14 15 16 Private Caching Threshold (PCT) 10000 1A ba -_--- 1000 0-radix +1u-contig -+(-lu-noncontig -*--ocean-contig -+-barnes 100 0 10 E 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Private Caching Threshold (PCT) Figure 4-16: Synergy between the locality-aware coherence protocol and ACKwise. Variation of average and maximum sharer count during invalidations as a function of PCT 72 Figure 4-16 evaluates the synergy between the locality-aware private cache replication scheme and ACKwise. It plots the average and maximum sharer count during invalidations as a function of PCT. As PCT increases, both the above sharer counts decrease since low-locality data is cached remotely. At a PCT of 4, the average sharer count in most benchmarks is reduced from 2 - 2.5 to 0.5 - 1. This reduces the number of invalidations and the number of overflows of the limited directory. 4.12 Summary This chapter introduced a locality-aware private cache replication protocol to improve on-chip memory access latency and energy efficiency in large-scale multicores. This protocol is motivated by the observation that cache lines exhibit varying degrees of reuse (i.e., variable spatio-temporal locality) at the private cache levels. A cache-line level classifier is introduced to distinguish between low and high-reuse cache lines. A traditional cache coherence scheme that replicates data in the private caches is employed for high-reuse data. Low-reuse data is handled efficiently using a remote access [31] mechanism. Remote access does not allocate data in the private cache levels. Instead, it allo& cates only a single copy in a particular core's shared cache slice and directs load store requests made by all cores towards it. Data access is performed at the word level and requires a roundtrip message between the requesting core and the remote cache slice. This improves the utilization of private cache resources by removing unnecessary data replication. In addition, it reduces network traffic by transferring only those words in a cache line that are accessed on-demand. Consequently, unnecessary invalidations and write-back requests are removed that reduce network traffic even further. The locality-aware private cache replication protocol preserves the familiar programming paradigm of shared memory while using remote accesses for data access efficiency. The protocol has been evaluated for the Sequential Consistency (SC) memory model using in-order cores with a single outstanding memory transaction per-core. 73 Evaluation on a 64-core multicore shows that the protocol reduces the overall Energy Consumption by 25% while improving the Completion Time by 15%. The protocol can be implemented with only 18 KB storage overhead per core when compared to the ACKwise 4 limited directory protocol, and has a lower storage overhead than a full-map directory protocol. 74 Chapter 5 Timestamp-based Memory Ordering Validation State-of-the-art multicore processors have to balance ease of programming with good performance and energy efficiency. The programming complexity is significantly affected by the memory consistency model of the processor. The memory model dictates the order in which the memory operations of one thread appear to another. The strongest memory model is the Sequential Consistency (SC) [59] model. SC mandates that the global memory order is an interleaving of the memory accesses of each thread with each thread's memory accesses appearing in program order in this global order. SC is the most intuitive model to the software developer and is the easiest to program and debug with. Production processors do not implement SC due to its negative performance impact. SPARC RMO [11, ARM [36] and IBM Power [651 processors implement re- laxed (/weaker) memory models that allow reordering of load & store instructions with explicit fences for ordering when needed. These processors can better exploit memory-level parallelism (MLP), but require careful programmer-directed insertion of memory fences to do so. Automated fence insertion techniques sacrifice performance for programmability [7]. Intel x86 [831 and SPARC [1] processors implement Total Store Order (TSO), which attempts to strike a balance between programmability and performance. The 75 TSO model only relaxes the Store-+Load ordering of SC, and improves performance by enabling loads (that are crucial to performance) to bypass stores in the write buffer. Note that fences may still be needed in critical sections of code where the Store-*Load ordering is required. Implementing the TSO model on out-of-order multicore processors in a straightforward manner sacrifices memory-level parallelism. This is because loads have to wait for all previous load/fence operations to complete before being issued while stores/fences have to wait for all previous load/store /fence operations. This inef- ficiency is circumvented in current processors by employing two optimizations [31. (1) Load performance is improved using speculative out-of-order execution, enabling loads to be issued and completed before previous load/fence operations. Memory consistency violations are detected when invalidations, updates or evictions are made to addresses in the load queue. The pipeline state is rolled back if this situation arises. (2) Store performance is improved using exclusive store prefetch [331 requests. These prefetch requests fetch the cache line into the L1-D cache and can be executed outof-order and in parallel. The store requests, on the other hand, must be issued and completed in-order to preserve TSO. However, most store requests hit in the L1-D cache (due to the earlier prefetch request) and hence can be completed quickly. The performance of fences is automatically improved by optimizing previous load & store operations. Note that the above two optimizations can also be employed to improve the performance of processors under sequential consistency or memory models weaker than TSO. 5.1 Principal Contributions This chapter explores how to extend the locality-aware private cache replication scheme for out-of-order speculative processors for popular memory models. Unfor- tunately, the remote (cache) access required by the locality-aware scheme is incompatible with the two optimizations described earlier. Since private cache copies of cache lines are not always maintained, invalidation /update requests cannot be used 76 to detect memory consistency violations for speculatively executed load operations. In addition, an exclusive store prefetch request is not applicable since a remote access never caches data in the private cache. In this chapter, we present a novel technique that uses timestamps to detect memory consistency violations when speculatively executing loads under the locality-aware protocol. Each load and store operation is assigned an associated timestamp and a simple arithmetic check is done at commit time to ensure that memory consistency has not been violated. This technique does not rely on invalidation/ update requests and hence is applicable to remote accesses. The timestamp mechanism is efficient due to the observation that consistency violations occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. This technique works completely in hardware and requires only 2.5 KB of storage per core. This scheme guarantees forward progress and is starvation-free. An alternate technique is also implemented that is based on the observation that memory consistency violations only occur due to conflicting accesses to shared readwrite data [84]. Accesses to private data and concurrent reads to shared read-only data cannot cause violations. Hence, a mechanism is designed that classifies data into 3 categories: (1) private, (2) shared read-only, and (3) shared read-write. Total Store Order (TSO) is enforced through serialization of load/ store/fence accesses for the 3rd category while memory accesses from the first two categories can be executed outof-order. The classification is done using a hardware-software mechanism that takes advantage of existing TLB and page table structures. The implementation ensures precise interrupts and smooth transition between the categories. This technique can also be used to improve the energy efficiency of the timestamp-based technique as will be discussed later. Our evaluation using a 64-core multicore with out-of-order speculative cores shows that the timestamp-based technique, when implemented on top of the locality-aware cache coherence protocol [54], improves completion time by 16% and energy by 21% over a state-of-the-art cache management scheme (Reactive NUCA 77 [391). The rest of this chapter is organized as follows. Section 5.2 provides background information about out-of-order core models. Section 5.3 provides background information about formal methods to specify memory models and will be used in proving the correctness of the timest amp-based validation technique presented in this chapter. Section 5.4 presents the timestamp-based technique to detect memory consistency violations. This section also introduces a few variations of the technique with different hardware complexity and performance trade-offs. Section 5.5 discusses how to implement the timestamp-based technique on other memory models and a couple of design issues. Section 5.6 presents an alternate hardware-software co-design technique that exploits the observation that memory consistency is only violated by conflicting accesses to shared read-write data. Section 5.7 describes the experimental methodology. Section 5.8 presents the results. And finally, Section 5.9 summarizes the chapter. 5.2 Background: Out-of-Order Processors To facilitate describing the mechanisms, we briefly outline the backend of a modern out-of-order processor. The backend consists of the Physical Register File (PRF), Reorder Buffer (ROB), Reservation Stations (RS) for each functional unit, Speculative Load Queue (LQ) and Store Queue (SQ). 1. Physical Register File (PRF) contains the values of all registers and is co-located with the functional units. 2. Register Alias Table (RAT) contains the mapping from the architectural register file (ARF) to the physical register file (PRF). 3. Reorder Buffer (ROB) ensures that instructions (/micro-ops) are committed in program order. Each micro-op is dispatched to and retired from the re-order buffer in program order. Program order commit is required to recover precise state in the event of interrupts, exceptions and mis-speculations. 4. Reservation Station (RS) implements the out-of-order logic. It holds the microops before they are ready to be issued to each functional unit. Each functional 78 unit has a dedicated reservation station. Micro-ops are dispatched to the RS in program order but can be retired out-of-order as soon as their operands are ready. 5. Load-Store Unit handles the memory load and store operations in compliance with the memory consistency model of the processors. 6. Execution Unit performs the integer and floating point operations. When the result of the functional unit (execution unit / load-store unit) is ready, the physical register ID of the result register is broadcasted on the common data bus (CDB). Each reservation station compares this result register ID against the operand IDs' of the micro-ops that are not yet ready. It marks the ready micro-ops so that they can be issued to the relevant functional unit in the next clock cycle. The result value of a functional unit is directly written into the physical register file. Likewise, when a micro-op is issued to a functional unit, the operands are read directly from the physical register file. Using a PRF in this manner reduces the data movement within a core and the removes the need to store register values in the ROB and RS. Only register IDs' are stored within the ROB and RS. However, since writes are done to the PRF out-of program order, special hardware support is needed to maintain precise register state. 5.2.1 Load-Store Unit The load-store unit performs the load and store operations in compliance with the memory consistency model of the processor. It contains the following components. 1. Address Unit computes the virtual address (given the register read operands). 2. Memory Management Unit (MMU) translates the virtual address into physical address using the TranslationLookaside Buffer (TLB) and page table hierarchy. 3. Store Queue holds the store addresses and values to be written to memory in program-order. 79 4. Load Queue serves to enforce the memory consistency model of the processor by detecting load speculation violations. We first define a few terms to facilitate describing the techniques. " DispatchTime: Time at which micro-ops are allocated an entry in the re-order buffer and reservation station. All memory operations are allocated an entry in the load or store queues. Load and store operations are dispatched in program order. * Issue Time: Time at which arithmetic/memory operations are issued from the reservation station to the execution units/cache subsystem. For a memory operation to be issued, the operands have to be ready. " CompletionTime: For a load operation, it is the time at which the load receives the data from the cache subsystem. For a store operation, it is the time at which the store receives an acknowledgement after its value has been written to the cache subsystem and propagated to all the cores in the system. " Commit Time: Time at which a memory operation is committed. Operations are committed in program order for precise state during exceptions and interrupts. A load is committed after it is complete and all previous operations have been committed. A store is committed as soon as its address and data operands are ready and virtual-to-physical address translation is complete. 5.2.2 Lifetime of Load/Store Operations The lifetime of a load operation is as follows. A load is first dispatched to the reorder buffer/reservation station in program order. Once the operands are ready, the physical address is computed (using the address calculation and the virtual-to-physical translation mechanisms), and the load is simultaneously issued to the load queue and cache subsystem. The load completes when the data is returned. committed in program order after it is completes. 80 And the load is The lifetime of a store operation is as follows. A store is first dispatched to the reorder buffer/ reservation station in program order. Once the address & data operands are ready, the physical address is computed, an exclusive store prefetch is issued for the address and the store request (both physical address & data) is added to the store queue. The store is committed in program order once the physical address is ready. Later, the store is issued to the cache subsystem after all previous stores are completed. Note the current store can be issued only after it commits to ensure precise exceptions. The store is completed after the data is written to the cache and an acknowledgement is received. 5.2.3 Precise State Maintaining precise state is required to recover from interrupts, exceptions or misspeculations. Exceptions are caused due to illegal instruction opcodes, divide-by-zero operations, and memory faults (TLB misses, page faults, segmentation faults, etc.). Mis-speculations are caused due to branch mispredictions and memory load ordering violations. The processor maintains two methods to recover precise state. The fast but costly method involves taking snapshots of the Register Alias Table on every instruction that could cause pipeline rollback. To recover precise state, the snapshot is simply restored. This method is commonly used for branch and load ordering mis-speculations. The slow but cheap method involves rolling back each architectural register's physical register translation starting with the latest dispatched entry backwards to the entry that initiated the pipeline rollback. This method is commonly used for exceptions and interrupts. 5.3 Background: Memory Models Specification In order to study which processor optimizations violate or preserve a particular memory consistency model, a formal specification of memory ordering rules is required. In this section, two methods of specification are discussed, the operational and the 81 axiomatic specification. We will prove the correctness of our timestamp-based validation technique using the axiomatic specification approach. 5.3.1 Operational Specification The operational specification presents an abstract memory model specification using This formal specification is close to an hardware structures such as store buffers. / actual hardware implementation and serves as a guide for hardware designers. I I I Y X Y X Y XI I 7 I I I ) Request - - - - -> Response Figure 5-1: Operational Specification of the TSO Memory Consistency Model. For example, the total store order (TSO) memory model can be specified using three types of components that run concurrently. A processor component (Pr) produces a sequence of memory operations, a memory location component (M) keeps track of the latest value written to a specific memory location, and a store buffer component (WQ) implements a FIFO buffer for write messages and supports read 82 forwarding. These components are connected in the configuration described in Figure 5-1. 5.3.2 Axiomatic Specification The axiomatic specification is programmer-centric and does not contain specifications about store buffers. This specification lists a set of axioms that define which execution traces are allowed by the model and in particular, which writes can be observed by each read. An execution trace is a sequence of memory operations (Read, Write, Fence) produced by a program. Each operation in the trace includes an identifier of the thread that produced that operation, and the address and value of the operation for reads and writes. Axiomatic specifications usually refer to the program order, <P. For two operations x and y, x <p y if both x and y belong to the same thread and x precedes y in the execution trace. The program order, however, is not the order in which the memory operations are observed by the main memory. The memory order, <in, is a total order that indicates the order in which memory operations affect the main memory. A read observes the latest write to the same memory address according to <m. However, a global memory order can be constructed only for those models that have a "store atomicity" property, i.e., all threads observe writes in the same order. Examples of store atomic memory models are sequential consistency (SC) and total store order (TSO). We only define memory models having the store atomicity property. Total store order (TSO) model can be defined using two axioms: Read- Values axiom and Ordering axiom. The read-values axiom requires the definition of a function sees(x, y, <). sees(x, y, <) is true if y is a write and y < x or y <p x The second part of this definition allows for load forwarding from a store (S1 ) to a later load (L 2 ) even though the load could appear earlier in the memory order (<i). Read Values. Given a read x and a write y to the same address as x, then x and y have the same value if sees(x, y, <i) 83 and there is no other write z such that sees(x, z, <i) sees(x, y, <i), and y <m z. If, for a read x, there is no write y such that then the read value is 0. Ordering. For every x and y, x <p y implies that x <n y, unless x is a write and y is a read. 5.4 Timestamp-based Consistency Validation This section introduces a novel timestamp-based technique for exploiting memory level parallelism (MLP) in processors with remote access based coherence protocols. This technique allows all load/store operations to be executed speculatively. The correctness of speculation is validated by associating timestamps with every memory transaction and performing a simple arithmetic check at commit time. This mechanism is efficient due to the observation that consistency violations occur due to conflicting accesses-that have temporal proximity (i.e., are issued only a few cycles apart). This enables the timestamps to be narrow and stored only for a small time window. We describe the working of this technique under the popular TSO memory model. Later, in Section 5.5.1, we discuss how it can be extended to stronger (i.e., SC) or weaker memory models. The timestamp-based technique is built up gradually through a sequence of steps. The microarchitecture changes needed are colored in Figure 5-2 and will be described when introduced. The implementation is first described for a pure remote access scheme (that always accesses the shared L2 cache). Later, we describe adaptations for the locality-aware protocol that combines both remote L2 and private Li cache accesses according to the reuse characteristics of data. 5.4.1 Simple Implementation of TSO Ordering We first introduce a straightforward method to ensure TSO ordering (without speculation). The TSO model can be implemented by enforcing the following two constraints: (1) Load operations wait (i.e., without being issued to the cache subsystem) 84 Core L-1 Cache m Compute Pipeline ROB_ L1-D Cache Network Router ROB : Reorder Buffer SQ: Store Queue LQ: Load Queue L2 Cache TC: Timestamp Counter TC-H : Timestamp - HRP Counter DRQ: Directory Request Queue E L1LHQ / L1SHQ: Li Cache Load / Store History Queue L2LHQ / L2SHQ: L2 Cache Load / Store History Queue Figure 5-2: Microarchitecture of a multicore tile. The orange colored modules are added to support the proposed modifications. till all previous loads and fences complete; (2) Store operations and fences wait till all previous loads, stores and fences complete. Note that the pipeline can continue executing as long as there are no unsatisfied dependencies (i.e., NO data hazard) and free entries in the load and store queues (NO structural hazard). If all memory references are satisfied by the Li cache, this scheme works quite well. However, a memory transaction that misses within the Li cache (as is the case for remote transactions), takes several (~10-100) cycles to complete. During this time, the load and store buffers fill up quickly and stall the pipeline. For a processor with a fetch width of 2 instructions per cycle, and a memory reference every 3 instructions, a memory operation has to complete within ~1.5 cycles to match the throughput of the pipeline. 5.4.2 Basic Timestamp Algorithm Next, we formulate a mechanism to increase the performance of load operations (instead of making them wait as described above). This requires enabling loads to execute (speculatively) as soon as their address operand is ready, potentially out-of program order and in parallel with other loads. This could violate the TSO memory 85 consistency model, and hence, a validation step is required to ensure that memory ordering is preserved. Stores and fences, on the other hand, have to wait till all previous loads, stores and fences have been completed. To check whether the speculative execution of loads violates TSO ordering, a timestanip-based validation technique is proposed. Timestamp Generation: Timestamps are generated using a per-core counter, the Timestamp Counter (TC), as shown in Figure 5-2. TC increments on every clock cycle. Assume for now that timestamps are of infinite width, i.e., they never rollover and all cores are in a single clock domain. We will remove the infinite width assumption in Section 5.4.4 and discuss the single clock domain assumption in Section 5.5.2. Timestamps are tracked at different points of time, e.g., during load issue, store completion, etc. and comparisons are performed on these timestamps to determine whether speculation has failed according to the algorithms discussed in the following subsections. Microarchitecture Modifications: The following changes are made to the L2 cache, the load queue and the store queue to facilitate the tracking of timestamps. L2 Cache: The shared L2 cache is augmented with an L2 Load History Queue (L2LHQ) and an L2 Store History Queue (L2SHQ) as shown in Figure 5-2. They track the times at which loads and stores have been performed at the L2 cache. The timestamp assigned to a load/store is the time at which the request arrives at the L2 cache. (Once a request arrives, all subsequent requests to the same cache line will be automatically ordered after it). Each entry in the L2LHQ / L2SHQ has two attributes, Address and Timestamp. An entry is added to the L2LHQ or L2SHQ whenever a remote load or store arrives. Assume for now, the L2LHQ / L2SHQ are of infinite size. We will remove this requirement in Section 5.4.3. Load Queue: Each load queue entry is augmented with the attributes shown in Figure 0-3. Only the Address, IssueTime and LastStoreTime fields are required for this 86 Address Figure 5-3: Structure of a load queue entry. algorithm. The IssueTime field records the time at which the load was issued to the cache subsystem and the LastStoreTime is the most recent modification time of the cache line. A remote load obtains this timestamp from the L2SHQ (L2 Store History Queue). If there are multiple entries in the L2SHQ corresponding to the address, the most recent entry's timestamp is taken. If no entries are found corresponding to the address, there has not been a store so far and hence, the most recent timestamp is '0'. This is then relayed back to the core over the on-chip network along with the word that is read. The load queue also contains a per-core OrderingTime field which is used for detecting speculation violations as will be described later. Address Data Figure 5-4: Structure of a store queue entry. Store Queue: Each store queue entry is augmented with the attributes shown in Figure 5-4. The Data contains the word to be written. The LastAccessTime field records the most recent access time of the cache line at the L2 cache. A remote store obtains the most recent timestamps from the L2LHQ and L2SHQ respectively. The maximum of the two timestamps is computed to get the LastAccessTime. This is then communicated back to the core over the on-chip network along with the acknowledgement for the store. The store queue also contains a per-core StoreOrderingTime field which is used for detecting speculation violations. Speculation Violation Detection: In order to check whether speculatively executed loads have succeeded in the TSO memory model, the IssueTime of a load must be greater than or equal to: 1. The LastStoreTime observed by previous load operations. This ensures that the current (speculatively executed) load can be placed after all previous load 87 operations in the global memory order. In other words, the global memory order for any pair of load operations from the same thread respects program order, i.e., the Load -+Load ordering requirement of TSO is met. 2. The LastAccessTime observed by previous store operations that are separated from the current load by a memory fence (MFENCE in x86). This ensures that the load can be placed after previous fences in the global memory order, i.e., the Fence -a Load ordering requirement is met. Algorithm 1 : Basic Timestamp Scheme 1: function COMMITLOAD() 2: if Issue Time < OrderingTime then 3: 4: 5: 6: REPLAYINSTRUCTIONSFROMCURRLOAD() return if OrderingTime < LastStore Time then OrderingTime <- LastStore Time 7: function RETIRESTORE() 8: if Store OrderingTime < LastAccess Time then 9: Store OrderingTime <- LastAccess Time 10: 11: 12: function COMMITFENCE() if OrderingTime < Store OrderingTime then OrderingTime <- Store OrderingTime The OrderingTime field in the load queue keeps track of the maximum of (1) the LastStoreTime observed by previously committed loads and (2) the LastAccessTime observed by previously retired stores that are separated from the current operation by a fence. The provides a convenient field to compare the issue time with when making speculation checks. The OrderingTime is updated and used by the functions presented in Algorithm 1. The COMMITLOAD function is executed when a load is ready to be committed. This algorithm checks whether speculation has failed by comparing the IssueTime of a load against OrderingTime. If the IssueTime is greater, speculation has succeeded and the OrderingTime is updated using the LastStoreTime observed by the current load. Else, speculation has failed, and the instructions are replayed starting from the current load. 88 The RETIRESTORE function is executed when a store is retired from the store queue (i.e., after it completes execution). Note that the store could have been committed (possibly much earlier) as soon as its address calculation and translation are done. The StoreOrderingTime field in the store queue keeps track of the maximum of the LastAccessTime observed by previously retired stores and is updated when each store is retired. The COMMITFENCE function is executed when a fence is ready to be committed. This function updates the OrderingTime using the StoreOrderingTime field and serves to maintain the Load - Fence ordering. On the other hand, if the IssueTime is lesser than the OrderingTime, a previous load operation could have seen a later store, possibly violating consistency. In such a scenario, the pipeline has to be rolled back and restarted with this offending load instruction. The above algorithms suffice for implementing the TSO memory model. If sequential consistency (SC) is to be implemented, all store operations are marked as being accompanied by an implicit memory fence. Hence, the IssueTime of a speculatively executed load must be greater than or equal to the maximum of the LastStoreTime and LastAccessTime observed by previous load and store operations, respectively. Proof of Correctness Here, a formal proof is presented on why the above timestamp-based validation preserves TSO memory ordering. The proof works by constructing a global order of memory operations that satisfies the Ordering axiom and the Read- Values axiom described earlier in Section 5.3. The global order is constructed by post-processing the memory trace obtained after executing the program. The memory trace consists of a 2-dimensional array of memory operations. The 1" dimension denotes a thread using its ID and the 2 ,d dimension lists the memory operations executed by that particular thread. Each memory operation (op) has 3 attributes: 1. Type: Denotes whether the operation is a Load, Store or Fence. 89 2. LastStoreTime: Wall-clock time at which the last modification was done to the cache line. 3. LastA ccess Time: Wall-clock time at which the last access was made to the cache line. Both LastStoreTime and LastAccessTime are used by the timestamp check in Algorithm 1 presented previously. The post-processing of the memory trace is done using Algorithm 2. Algorithm 2 : Global Memory Order (<,) Construction - Basic Timestamp 1: function CONSTRUCTGLOBALMEMORYORDER(NumThreads, MemoryTrace) 2: <m<- {} > Initialize the global memory order 3: for t <- 1 to NumThreads do > Iterate over all program threads 4: OrderingTime <- 0 StoreOrderinbgTime- 0 5: 6: for each op C MemoryTrace[t] do 7: if op.Type = LOAD then 8: OrderingTime <- max(OrderirngTime,op.LastStoreTime) + 6 9: <m [op] <- OrderingTime 10: else if op.Type = STORE then 11: StoreOrderingTime+- max(StoreOrderingTime, op.LastAccessTime) + 6 12: OrderingTime - OrderingTime+ 6 13: <m [op] +- max(StoreOrderingTime,OrderingTime) 14: else if op.Type = FENCE then 15: OrderingTime +- max(OrderingTime,StoreOrderingTime)+ 6 16: <m [op] *- OrderingTime return <m In the outer loop, the algorithm iterates through all the program threads. In the inner loop, the memory operations (LOAD,STORE & FENCE) of each thread are iterated through in program order. The algorithm assigns a timestamp to each memory operation and inserts the operation into the global memory order <m in the ascending order of timestamps. The '<m [op] <-' statement ensures that each operation is inserted in ascending order. Here, 6 is an infinitesimally small quantity (~ 2) that is added to ensure that a unique timestamp is assigned to all memory operations from the same thread. For all programs considered, the result obtained by summing 6 over all memory operations is less than 1. When inserting in the global order <m, if multiple operations from different program threads are assigned the same timestamp, then the placement is 90 done in the ascending order of thread IDs (multiple operations from the same thread cannot have the same timestamp). Ordering-Time and Store- Ordering-Time are updated based on op.LastStoreTine and op.LastAccessTime similar to the logic in Algorithm 1 presented earlier. Ordering Axiom: The ordering axiom is satisfied by the above assignment of timestamps due to the following reasons. If the timestamp assigned to a load or fence operation is T, then the timestamps assigned to all subsequent load, store and fence operations (from the same program thread) are at least T + 6. This ensures that the Load Fence -+ -+ op and op ordering requirements are met. (Here, op could refer to a Load, Store or Fence). If the timestamp assigned to a store operation is T, then the timestamps assigned to all subsequent store and fence operations are at least T, that the Store -+ Store and Store - + 6. This ensures Fence ordering requirements are met. Read- Values Axiom: The read-values axioms is satisfied because the above assignment of timestamps preserves the order of load & store operations to the same cache line. For exam- ple, let the time at which a load arrives at the L2 cache be denoted as Load Time. LastStore Time indicates the time at which the last store to the same cache line arrived at the L2 cache. By definition, LoadTime > LastStoreTime. Since speculation has succeeded, the timestamp check performed at commit time ensures that LoadTime > OrderingTime. (Note that the speculation check in Algorithm 1 actually ensures that IssueTime > OrderingTime, but since LoadTime > IssueTime, it follows that LoadTime > OrderingTime). Now, the timestamp assigned by Algorithm 2 to the load operation described earlier is max(OrderingTime, LastStoreTime) + 6. This ensures two invariants: (1) the load timestamp is strictly after the last observed store operation to the cache line, and (2) the load timestamp is strictly before the time the load operation arrives at the cache. Similarly, the timestamp assigned to a store operation ensures two invariants: (1) the store timestamp is strictly after the last observed access to the cache line, and 91 (2) the store timestmap is strictly before the time the store operation arrives at the cache. The above invariants ensure that the global memory order preserves the actual order of load & store operations to the same cache line, thereby satisfying the readvalues axiom. 5.4.3 Finite History Queues One main drawback of the algorithm discussed earlier was that the size of the load/store history queues could not be bounded (i.e., they could grow arbitrarily large). The objective of this section is to bound the size of the history queues. Note that the timestamps could still grow arbitrarily large (we will remove this requirement in Section .A. A). The history queues can be bounded by the observation that the processor cores only care about load/store request timestamps within the scope of their reorder buffer (ROB). For example, if the oldest micro-op in the reorder buffer has been dispatched at 1000 cycles, the processor core does not care about load/store history earlier than 1000 cycles. This is because memory requests from other cores carried out before this time will not cause consistency violations. Hence, history needs to be retained only for a limited interval of time called the History Retention Period (HRP). After the retention period expires, the corresponding entry can be removed from the history queues. How long should the history retention period be? Intuitively, if the retention period is equal to the maximum lifetime of a load or store operation starting from dispatch till completion, no usable history will be lost. However, the maximum lifetime of a memory operation cannot be bounded in a modern multicore system due to non-deterministic queueing delays in the on-chip network and memory controller. An alternative is to set HRP such that nearly all (~ 99%) memory operations complete within that period. If operations do not complete within HRP, then spurious violations might occur. As long as these violations only affect overall system performance and energy by a negligible amount, they can be tolerated. Increas- ing the value of HRP reduces spurious violations but requires large history queues 92 (L2LHQ/L2SHQ) while decreasing the value of HRP has the reverse effect (this will be explained below). Finding the optimal value of HRP is critical to ensuring good performance and energy efficiency. Speculation Violation Detection: Speculation violations are detected using the same algorithms in Section 5.4.2. However, since entries can be removed from the load/store history queues, checking the history queues may not yield the latest load and store timestamps. Hence, an informed assumption regarding previous history has to be made. If no load or store history is observed for a memory request, it can be safely assumed that the request did not observe any load/store operations at the L2 cache after 'CompletionTime - HRP'. Here, CompletionTime is the timestamp when the load/store request completes (i.e., the load value is returned or the store is acknowledged). To prove the above-statement for a load, consider where the LoadTime (i.e., the - time when the data is read from the L2 cache) lies in relation to CompletionTime HRP. If LoadTime < CompletionTime - HRP, then the statement is trivially true. If LoadTime > CompletionTime - HRP, then the load would only observe the results of a store operation carried out between CompletionTime - HRP and LoadTime. However, if this is the case, then the store operation timestamp would be still visible in the store history (since the retention period has not expired). A similar logic holds for the load history as well. Hence, if no history is observed, the LastStoreTime and LastAccessTime required by the algorithms in Section 5.4.2 can be adjusted as shown by the ADJUSTHIsTORY function in Algorithm 3. Note that 'NONE' indicates no load/store history for a particular address. Algorithm 3 : Finite History Queues 1: function ADJUSTHISTORY() 2: if LastLoadTime = NONE then LastLoad Time +- CompletionTime - HRP 3: 4: 5: if LastStoreTime = NONE then 6: LastAccess Time LastStoreTime - CompletionTime - HRP <- MAX( LastLoad Time, 93 LastStore Time) Finite Queue Management: Adding entries to the history queue and searching for an address works similar to the description in Section 5.4.2. However, with finite number of entries, two extra considerations need to be made. 1. History queue overflow needs to be considered and accommodated. 2. Queue entries need to be pruned after the history retention period (HRP) expires. The finite history queue is managed using a set-associative structure, that is indexed based on the address (just like a regular set-associative cache). Queue Insertion: When an entry (<Address, Timestamp>) needs to be added to the queue, the set corresponding to the address is first read into a temporary register. A pruning algorithm is applied on this register to remove entries that have expired (this will be explained later). Then, the <Address,Timestamp> pair is added to the set as follows. If the address is already present, then the maximum of the already present timestamp and the newly added timestamp is computed and written. If the address is not present, then the algorithm checks whether an empty entry is present. If yes, then the new timestamp and address are written to the empty entry. Else, the oldest timestamp in the set is retrieved and evicted. The new <Address,Timestamp> pair is written in its place. In addition to the set-associative structure, each queue also contains a Conservative Timestamp (Cons Time) field. The ConsTime field is used to hold timestamps that have overflowed till they expire. The ConsTime field is updated with the evicted timestamp. Pruning Queue: A pruning algorithm removes entries from the queue after their retention period (HRP) has expired (and/or) resets the ConsTime field. This function uses a second counter called TC-H (shown in Figure 5-2). This counter lags behind the timestamp counter (TC) by HRP cycles. If a particular timestamp is less than TC-H, the timestamp has expired and can be removed from the queue. On every processor cycle, the TC-H value is also compared to ConsTime. expired and can be reset to 'NONE'. 94 If equal, the ConsTime has Searching Queue: When searching the queue for an address, the set corresponding to the address is first read into a temporary register. If there is an address match, then the timestamp is returned. Else, the ConsTime field is returned. To maintain maximum efficiency, the ConsTime field should be 'NONE' (expired) most of the time, so that spurious load/store times are not returned. All the above functions require the entries in a set to be searched in parallel to be performance efficient. However, since these history queues are really small (< 0.8KB), the above computations can be performed within the cache access time. We observe experimentally that setting the associativity of the history queues to the same associativity as the cache keeps the ConsTime field expired for nearly all of the execution. 5.4.4 In-Flight Transaction Timestamps One of the main drawbacks of the algorithm presented in Section 5.4.2 is that the timestamps should be of infinite width, i.e., they are never allowed to rollover during the operation of the processor. This drawback can be removed by the observation that only the timestamps of memory operations present in the reorder buffer (ROB) need to be compared when detecting consistency violations, i.e., only memory transactions that have temporal proximity could create violations. Hence, Load and Store timestamps need to be distinct and comparable only for recent memory operations. Finite Timestamp Width (TW): This observation can be exploited to use a finite timestamp width (TW). When the timestamp counter reaches its maximum value, it rolls over to '0' in the next cycle. The possible values that can be taken by the counter can be divided into two quantums, an 'even' and an 'odd' quantum. During the even quantum, the MSB of the timestamp counter is '0' while during the odd quantum, the MSB is '1'. For example, if the timestamp width (TW) is 3 bits, values 0-3 belong to the even quantum while 4-7 belong to the odd quantum. Now, to check which timestamp is greater than the other, it needs to be known whether the current quantum (i.e., the MSB of the TC odd quantum. [Timestamp Counter]) is an even or an If the current quantum is even, then any timestamp with an even 95 quantum is automatically greater than a timestamp with an odd quantum and vice versa. If two timestamps of the same quantum are compared, a simple arithmetic check suffices to know which is greater. Recency of Timestamps: A possible problem with the above comparison is when the timestamps that are compared are not recent. For example, consider a system with a timestamp width (TW) of 3 bits. Assume TC is set to 3 to start-with. Timestamp, TA is now generated and set to the value of TC, i.e., 3. Then, TC increments, reaches its maximum value of 7 and rolls-over to 1. Now, another timestamp, If the check, TA > TB discussed above. But, TB is set to 1. is now performed, the result is true according to the algorithm TA was generated before TB, so the result should have been false. The comparison check returned the wrong answer because TA was 'too old' to be useful. Timestamps have to be 'recent' in order to return an accurate answer during comparisons. Given a particular value of the timestamp counter (TC), timestamps have to be generated in the current quantum or the previous quantum to be useful for comparison. In the worst case, a timestamp should have been generated at most 2 TW-i cycles before the current value of TC to be useful. Consistency Check: In the algorithms described previously, the only arithmetic check done using timestamps is at the commit point of load operations. The check performed is: IssueTime > OrderingTime. If both IssueTime and OrderingTime are recent, the check always returns a correct answer, else it might return an incorrect answer. Now, if IssueTime is recent and OrderingTime is old, the 'correct' answer for the consistency check is true, however, it might return false in certain cases. The answer being 'false' is OK, since all it triggers is a false positive, i.e., it triggers a consistency violation while in reality, there is no violation. As long as the number of false positives is kept low, the system functions efficiently. So, the important thing is to keep IssueTime recent. This is accomplished by adding another bit to each reorder buffer (ROB) entry to track the MSB of its DispatchTime (i.e., the time at which the micro-op was dispatched to the ROB). So, each ROB entry tracks if it was dispatched during the even or the odd quantum. If the DispatchTime is kept 'recent', the IssueTime of a 96 load operation will also be recent, since issue is after dispatch. The DispatchTime is kept recent by monitoring the entry at the head of the ROB and the timestamp counter (TC). If the TC rolls over from the odd to the even quantum with the head of the ROB pointing to an entry dispatched during the even quantum, then that entry's timestamp is considered 'old'. A speculation violation is triggered and instructions are replayed starting with the one at the head of the ROB. Likewise, if the TC rolls over from the even to the odd quantum with the head pointing to an odd quantum entry, a consistency violation is triggered. Through experimental observations, the timestamp width (TW) is set to 16 bits. This keeps the storage overhead manageable while creating almost no false positives. With TW = 16, each entry in the ROB has 2 -TW-1 = 32768 cycles to commit before a consistency violation is triggered due to rollover. 5.4.5 Mixing Remote Accesses and Private Caching The previous sections described the implementation of TSO on a pure remote access scheme. The locality-aware protocol chooses either remote access at the shared L2 cache or private caching at the Li cache based on the spatio-temporal locality of data. Hence, the timestamp-based consistency validation scheme should be adapted to such a protocol. Li Cache History Queues (L1LHQ/L1SHQ): Such an adaptation requires information about loads and stores made to the private L1-D cache to be maintained for future reference in order to perform consistency validation. This information needs to be captured because private L1-D cache loads/stores can execute out-of-order and interact with either remote or private cache accesses such that the TSO memory consistency model is violated. Similar to the history queues at the L2 cache, the Li Load History Queue (L1LHQ) and the Li Store History Queue (L1SHQ) are added at the LI-D cache (shown in Figure 5-2) and capture the load and store history respectively. The history retention period (HRP) 97 dictates how long the history is retained for. The management of the L1LHQ/L1SHQ (i.e., adding/ pruning/ searching) is carried out in the exact same manner as the L2 history queues. With history queues at multiple levels of the cache hierarchy, it is important to keep them synchronized. An invalidation, downgrade, or eviction request at the Ll-D cache causes the last load/store timestamps (if found) to be sent back along with the acknowledgement so that they can be preserved at the shared L2 cache history queues until the history retention period (HRP) expires. From the L2LHQ/L2SHQ, these load/store timestamps could be passed onto cores that remotely access or privately cache data. A cache line fetch from the shared L2 cache into the Ll-D cache copies the load/store history for the address into the L1LHQ/L1SHQ as well. This enables the detection of consistency violations using the same timestamp-based validation technique described earlier. Entries are added to and retrieved from the Li history queues using the same mechanism as described for the L2 history queues. Pruning works in the same way as well. If the Li history queues overflow, the mechanism described earlier is used to retrieve a conservative timestamp. Exclusive Store Prefetch: Since exclusive store prefetch requests can be used to improve the performance of stores that are cached at the L1-D cache, they must be leveraged by the localityaware protocol also. In fact, these prefetch requests can be leveraged by remote stores to prefetch cache lines from off-chip DRAM into the L2 cache. This can be accomplished only if both private and remote stores are executed in two phases. The 1st phase (exclusive store prefetch) is executed in parallel and potentially out-of program order as soon as the store address is ready. If the cache line is already present at the L2 cache, then the 1st phase for remote stores is effectively a NOP but must be executed nevertheless since the information about whether a store is handled remotely or cached privately is only present at the directory (that is co-located with the shared L2 cache). The 2nd phase (actual store) is executed in order, i.e., the 98 store is issued only after all previous stores have completed (to ensure TSO ordering) and the 1st phase of the current store has been acknowledged. stores to the private L1-D cache complete quickly (~ The 2nd phase for 1-2 cycles), while remote stores have to execute a round-trip network traversal to the remote L2 cache before being completed. Parallel Remote Stores 5.4.6 This section deals with improving the performance of remote store requests. Serializing remote store requests could impact the processor performance in two ways: 1. Increase the store queue stalls due to capacity limitations. 2. Increase the reorder buffer (ROB) stalls due to store queue drains when fences are present. If an application has a significant stream of remote stores, the store queue gets filled up quickly and stalls the pipeline. The pipeline can now execute only at the throughput of remote store operations. To match the maximum throughput of a processor pipeline with a fetch width of 2 instructions, store operations have to execute at a throughput of one every 4.5 cycles (assuming 1 memory reference every 3 instructions and 1 store every 3 memory accesses). This is not possible since a round-trip network request takes - 24 cycles (assuming an 8x8 mesh and a 2 cycle per-hop latency) without factoring in the cache access cost or network contention. Moreover, if the application contains fences, the fence commit has to stall till the store buffer has been drained which takes a higher amount of time if remote store operations are present. This increases the occupancy of the fence in the ROB which increases the probability of ROB stalls. In this section, a mechanism is developed that allows remote store operations to be issued in-order, executed in parallel and completed out-of-order with other memory operations. The mechanism is based on the idea that a processor that issues store operations to memory in program order is indistinguishable from one that executes 99 store operations in parallel and are observed by a thread running on a different core in program order. This is accomplished by leveraging the two-phase mechanism used when executing remote store operations. The 1st phase, in addition to performing an exclusive prefetch, informs the L2 cache that a remote store is pending. The remote cache captures this information in an additional storage queue called the Pending Store History Queue (PSHQ) (shown in Figure 5-2). The PSHQ captures the time at which the pending store history arrived at the remote L2 cache. (Note that for a private L1-D cache store, the PSHQ is not modified). The purpose of the PSHQ is to inform future readers that a remote store is about to happen in the near future. This information is used in the speculation validation process as will be described later. The 2nd phase, when performing the actual store operation (i.e., writing the updated value to the remote cache), removes the corresponding entry from the PSHQ. The 2nd phase of a private/remote store can be issued only after all previous microops have committed (to ensure precise interrupts), all previous stores to the private L1-D cache have completed and the 1st phases of all previous remote stores have been acknowledged. Waiting for all previous private L1-D cache stores to be completed implies that the StoreA -+StoreB ordering is trivially preserved if StoreA is intended for the private L1-D cache. When an L2 cache load request is made, the cache controller searches the PSHQ in addition to the L2LHQ/L2SHQ and returns a LastPendingStoreTime (if found). In case multiple entries are found, the minimum timestamp (i.e., the oldest entry) is returned to the requester. The COMMITLOAD function in Algorithm 4 is used to validate (at the commit time of a load) that the Load-+Load and Store-+Store ordering requirements have been met. At the commit time of a load, the minimum of the LastPendingStoreTime and the IssueTime of the load are compared to OrderingTime. If greater, it has two implications: (1) the load has been issued after all stores observed by preceding loads have been written to the shared L2 cache, and (2) any pending store occurs after all the stores observed by the current thread, in program order. However, if the LastPendingStoreTime is lesser than the OrderingTime, it is possible that the pending store occurs before a store observed by the current thread 100 in program order and hence, memory consistency could be violated. The pending store history serves as an instrument for enabling future stores to be executed before the 2nd phase of the current store has been completed. Algorithm 4 : Parallel Remote Stores 1: function COMMITLOAD() 2: if MIN(IssueTime, LastPendingStoreTime) < OrderingTime then REPLAYINSTRUCTIONsFROMCURRLOAD() 3: 4: 5: 6: return if OrderingTime < LastStoreTime then OrderingTime = LastStoreTime Queue Overflow: If the PSHQ is full, then an entry is not added during the 1st phase. The requesting core is notified of this fact during the acknowledgement. In this scenario, the core delays executing the 2nd phase of all future stores until the 2nd phase of the current remote store has been completed. This trivially ensures that the current remote store is placed before later stores (both private/remote) in the global memory order. Speculation Failure: If speculation fails, then the PSHQ entries corresponding to the remote stores on the wrong path have to be removed. This is accomplished using a separate network message that removes the PSHQ entries before instructions are replayed. 5.4.7 Overheads The overheads of the timestamp-based technique that uses all the mechanisms described previously are stated below. Li Cache: The L1SHQ (L1 store history queue) is sized based on the expected throughput of store requests to the private L1-D cache and the History Retention Period (HRP). In Section 5.8.3, HRP is fixed to be 512 ns. A memory access is expected every 3 instructions, and a store is expected every 3 memory accesses, so for a single issue processor with a 1 GHz clock, a store is expected every 9 ns. Each SHQ entry contains the store timestamp and the physical cache line address (42 bits). The width of each timestamp is 16 bits (as discussed above). Hence, the size of the 101 L1SHQ = 2x+ bits = 0.4KB. The throughput of loads is approximately twice that of stores, hence the size of the L1LHQ (LI load history queue) is 0.8KB. Since the L1LHQ and L1SHQ are much smaller than the L1-D cache, they can be accessed in parallel to the cache tags, and so, do not add any extra latency. The energy expended when accessing these structures is modeled in our evaluation. L2 Cache: The L2SHQ is sized based on the expected throughput of remote store requests to the L2 cache and invalidations /write-backs from the L1-D cache. The throughput of remote requests is much less than that of private L1-D cache requests, but can be susceptible to higher contention if many remote requests are destined for the same L2 cache slice. To be conservative, the expected throughput is set to one store every 18 processor cycles (this is 4x the average expected throughput from experiments). The same calculation (listed above) is repeated to obtain a size of 0.2KB. The L2LHQ has twice the expected throughput as the L2SHQ, so its size is 0.4KB. The PSHQ (Pending Store History Queue) is sized based on the expected throughput of remote stores as well. Each pending store is maintained till its corresponding 2nd phase store arrives. This latency is conservatively assumed to be 512 ns. Also, each entry in the PSHQ contains a requester core ID in addition to the address and 5 2 timestamp. Hence, the size of the PSHQ 6 6 42 x(1 18 + + )bits = 0.2KB. Since the L2LHQ, and L2SLHQ are much smaller than the L2 cache, they can be accessed in parallel to the cache tags, and so do not add any extra latency. The energy expended when accessing these structures is modeled in our evaluation. Load/Store Queues & Reorder Buffer: Each load queue and store queue entry is augmented with 3 and 1 timestamps respectively (as shown in Figures 5-3 & 5-4). With 64 load queue entries, the overhead is 64 x 3 x 16 bits = 384 bytes. With 48 store queue entries, the overhead is 48 x 16bits = 96 bytes. A single bit added to the ROB for timestamp overflow detection only has a negligible overhead. Network Traffic: Whenever an entry is found in the L1LHQ/L1SHQ on an invalidation / write-back request or an entry is found in the L2LHQ/L2SHQ during a remote access or cache line fetch from the L2 cache, the corresponding timestamp is 102 added to the acknowledgement message. Since each timestamp width is 16 bits and & the network flit size is 64 bits (see Table 2.1), even if all three of the load, store pending-store timestamps need to be accommodated, only 1 extra flit needs to be added to the acknowledgement. The total storage overhead is ~ 2.5KB counting all the above changes. 5.4.8 Forward Progress & Starvation Freedom Guarantees Forwardprogress for each core is guaranteed by the timestamp-based consistency validation protocol. To understand why, consider the two reasons why load speculation could fail: (1) Consistency Check Violation, and (2) Timestamp Rollover. If speculation fails due to the consistency check violation, then re-executing the load is guaranteed to allow it to complete. This is because the IssueTime of the load (when executed for the second time) will always be greater than the time at which the consistency check was made, i.e., the commit time of the load, which in turn is greater than OrderingTime. This is because OrderingTime is simply the maximum of load and store timestamps observed by previous memory operations, and the time at which the load is committed is trivially greater than this. If speculation fails due to timestamp rollover, then re-executing the load/stores is guaranteed to succeed because it cannot conflict with any previous operation. Since forward progress is guaranteed for all the cores in the system, this technique of ensuring TSO ordering is starvation-free. 5.5 5.5.1 Discussion Other Memory Models Section 5.4 discussed how to implement the TSO memory ordering on the localityaware coherence protocol. The TSO model is the most popular, being employed by x86 and SPARC processors. Other memory models of interest are Sequential Consistency (SC), Partial Store Order (PSO), and the IBM Power/ARM models. 103 We provide an overview of how they can be implemented with the timestamp-based scheme. Sequential Consistency (SC) can be implemented by associating an implicit fence after every store operation. Hence, in the RETIRESTORE function in Section 5.4.2, each store directly updates OrderingTime using its LastAccessTime. This ensures that the Store -+Load program order is maintained in the global memory order. Partial Store Order (PSO) relaxes the Store -+Store ordering and only enforces it when a fence is present. This enables all stores, both private & remote, to be issued in parallel and potentially completed out-of-order. On a fence that enforces Store -+Store ordering, stores after a fence can be issued only after the stores before it complete. IBM Power is a more relaxed model that enforces minimal ordering between memory operations in the absence of fences. Here, we discuss how its two main fences, lwsync and hwsync are implemented. The lwsync fence enforces TSO ordering and can be implemented by maintaining a LoadOrderingTime field that keeps track of the maximum LastStoreTime observed so far. On a fence, the LoadOrderingTime is copied to the OrderingTime field and the timestamp checks outlined earlier are run. The hwsync fence enforces SC ordering. This can be implemented by taking the maximum of the LoadOrderingTime and StoreOrderingTime and updating the OrderingTime field with this maximum. The ARM memory model is similar to the IBM Power model and hence can be implemented in a similar way. 5.5.2 Multiple Clock Domains The assumption in Section 5.4 was that there was only a single clock domain in the system. However, current multicore processors are gravitating towards multiple clock domains with independent dynamic frequency scaling (DVFS). All current processors have a global clock generation circuit (PLL) (to the best of our knowledge). The global clock is distributed using digital clock divider circuits to enable multiple cores to run at different frequencies. However, clock boundaries are synchronous and predictable. This is done because asynchronous boundaries within a chip have reliability concerns 104 and PLLs are power hungry circuits. If clock boundaries are synchronous, per-core timestamps that are incremented on each core's local clock can be translated into 'global' timestamps using a multiplier that is inversely proportional to each core's frequency. The per-core timestamp counters, however, cannot be clock-gated since they must always be translatable into a global timestamp using a multiplier. If a core changes its frequency, the timestamp counter must be re-normalized to the new frequency and the multiplier changed as well. Lastly, in the event that multiple clock generation circuits are used, it is possible to bound the clock skew between cores and incorporate this skew into our load speculation validator (basically, each store timestamp would be incremented with this skew before comparison). 5.6 Parallelizing Non-Conflicting Accesses An alternate/ complementary method to exploit memory level parallelism (MLP) while maintaining TSO is to recognize the fact that only conflicting accesses to shared read-write data can cause memory consistency violations. Concurrent reads to shared read-only data and accesses to private data cannot lead to violations [84]. Such mem- ory accesses can be both issued and completed out-of-order. Only memory accesses to shared read-write data must be ordered as described in Section 5.4.1. 5.6.1 Classification In order to accomplish the required classification of data into private, shared readonly and shared read-write, a page-level classifier is built by augmenting existing TLB and page table structures. Each page table entry is augmented with the attributes shown in Figure 5-5. " Private: Denotes if the page is private ('1') or shared ('0'). " Read-Only: Denotes if the page is read-only ('1') or read-write ('0'). 105 Figure 5-5: Structure of a page table entry. * Last-Requester-ID: Tracks the last requester that accessed the particular page. Aids in classifying the page as private or shared. Let us describe how memory accesses are handled under this implementation. Load Requests: On a load request, the TLB is looked up to obtain attributes of a page. If the memory request misses the TLB, the TLB miss handler is invoked. The operation of this handler is described in Section 5.6.2. Once the TLB miss handler populates the TLB with the page table entry of the requested address, the attributes are looked up again. If the page is private or read-only, then the load request can be immediately issued to the cache subsystem. But if the page is shared and read-write, the load request waits in the load queue till all previous loads complete. Store Requests: On a store request, the TLB is looked up to obtain the attributes. If it is a TLB miss, the miss handler is invoked again. Once the TLB is populated, the attributes are looked up. If the page is read-only, the Page Protection Fault Handler is invoked. The operation of this handler is described in Section 5.6.3. When the handler returns, the read-only attribute of the page should be marked as '0'. Now, if the page is private, the store request can be issued as long as previous operations have all committed (this is to maintain single-thread correctness). On the other hand, if the page is shared, the store request waits in the store queue till all previous stores have completed. 5.6.2 TLB Miss Handler On a TLB miss, the handler checks if the page is being accessed for the first time. If yes, the Private bit is set to '1' and the last-requester-ID is set to the ID of the currently accessing core. The read-only bit is also set to '1' since this is the very first access to the page. On the other hand, if the page was accessed earlier, the status of the private bit is checked. If the private bit is set to 1, the last-requester-ID is 106 checked against the ID of the currently accessing core. If it matches, the page table entry is simply loaded into the TLB. Else, the entry in the last-requester-ID'sTLB is invalidated. The invalidation request only returns after all outstanding memory operations to the same page made by the last-requester-ID have completed and committed. This is required to ensure that the TSO model is not violated because these outstanding requests are now directed at shared data. In addition, the private bit is set to '0' since this page is now shared amongst multiple cores. On the other hand, if the private bit was set to '0' to start with, then the page table entry is simply loaded into the TLB. 5.6.3 Page Protection Fault Handler A page-protection fault is triggered when a write is made to a page whose read-only attribute is set to '1'. The fault handler first checks if the page is private. If so, the read-only is simply set to '0'. If the page is shared, then the TLB entries of all the cores in the system are invalidated. Finally, the read-Only attribute is set to '0' and the handler returns. 5.6.4 Discussion The advantage of the above scheme is that it requires negligible additional hardware capabilities. However, operating system support is required. Moreover, the performance is highly dependent on the number of private and read-only pages in the application. 5.6.5 Combinining with Timestamp-based Speculation The TSO ordering of shared read-write data can be implemented in a straightforward manner following the steps in Section 5.4.1. A higher performing strategy would be to execute the accesses to shared read-write data out-of program order and employ the timestamp-based speculative execution scheme discussed in the previous sections for ensuring TSO ordering. Note that using the timestamp check only for shared 107 read-write data implies that the history queue modifications and search operations can be avoided for private & shared read-only data. This reduces the energy overhead of the history queues. We will evaluate these two approaches in this chapter. 5.7 Evaluation Methodology We evaluate a 64-core shared memory multicore using out-of-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The parameters specific to the timestamp-based speculation violation detection and the locality-aware private cache replication scheme are shown in Table 5.1. Architectural Parameter Value Out-of-Order Core Speculation Violation Detection Timestamp-based Timestamp Width (TW) 16 bits History Retention Perion (HRP) 512 ns Li Load/Store History Queue (L1LHQ//L1SHQ) Size 0.8 KB, 0.4 KB L2 Load/Store History Queue (L2LHQ/L2SHQ) Size 0.4 KB, 0.2 KB Pending Store History Queue (PSHQ) Size 0.2 KB Locality-Aware Private Cache Replication Private Caching Threshold PCT= 4 Max Remote Access Threshold RA TmaX = 16 Number of RAT Levels nRATievels = 2 Classifier Limited 3 Table 5.1: Timestamp-based speculation violation detection & locality-aware cache private cache replication scheme parameters. 5.7.1 Performance Models Word & Cache Line Access: The Locality-aware Private Cache Replication protocol requires two separate access widths for reading or writing the shared L2 caches i.e., a word for remote sharers and a cache line for private sharers. For simplicity, we assume the same L2 cache access latency for both word and cache line accesses. Load Speculation: The overhead of speculation violation is modeled by fast-forwarding the fetch time of the offending instruction to its commit time (on the previous at108 tempt) when a violation is detected. This models the first-order performance drawback of speculation. However, the performance drop due to the network traffic and cache accesses incurred by in-flight instructions is not modeled. Energy Models 5.7.2 Word & Cache Line Access: When calculating the energy consumption of the L2 cache, we assume a word addressable cache architecture. This allows our protocol to have a more efficient word access compared to a cache line access. We model the dynamic energy consumption of both the word access and the cache line access in the L2 cache. Load Speculation: The energy overhead due to speculation violations is obtained using the following analytical model. Energy-Speculation= Speculation-Stall-Cycles x IPC x Energy-per-Instruction The first 2 terms together generate the total number of inflight instructions when the speculation violation was detected. If this is multiplied by the average energy per instruction, the energy overhead due to speculation can be obtained. History Queues: The energy consumption for the INSERT and SEARCH operations of each history queue is conservatively assumed to be the amount of energy it takes for an L1-D cache tag read. Note that the size of the L1-D cache tag array is 2.6 KB. The tag array contains 512 tags, each 36 bits wide (subtracting out the index and offset bits from the physical address). On the other hand, the size of each history queue is < 0.8 KB. 5.8 5.8.1 Results Comparison of Schemes In this section, we perform an exhaustive comparison between the various schemes introduced in this chapter to implement locality-aware coherence on an out-of-order 109 processor while maintaining the TSO memory model. The comparison is performed against the Reactive-NUCA protocol. All implementations of the locality-aware protocol use a PCT value of 4. Section 5.8.2 describes the rationale behind this choice. " Instructions 0Li-1 Fetch Stalls U Compute Stalls U Memory Stalls " Load Speculation Y Branch Speculation * Synchronization N Idle -1.6 1 I.2 *02 VOIREND BARNES OCEAN-NC 1.6 RADIX LU-C WATER-NSQ CANNEAL FACESIM SWAPTIONS BIACKSCH IPI111II P 0.6 1.4 0. E 1.2 0~ 0IIIgII iii~: B Figre YTAK EUP PATRICIA CONCOMP COMMUNITY v 11 1111 DIJKSTRA-AP TSP DFS -6:Completion Time breakdown for the schemes evaluated. MATMUL Results are n LzENd t ARNES t Of R-NUCA. Note that Average and not Geometric-Mean is plotted here. 1. Reactive-NUCA (RNUCA): This is the baseline scheme that implements the data placement and migration techniques of R-NUCA (basically, the localityaware protocol with a PCT of 1). 2. Simple TSO Implementation (SER): The simplest implementation of TSO on the locality-aware protocol that serializes memory accesses naively according to TSO ordering (c.f. Section 5.4.1). 3. Parallel Non-Conflicting Accesses (NC): This scheme classifies data as shared/private 110 " Li-1 Cache " Network Router * L1-D Cache 0 L2 Cache * Directory U Network Link 0 DRAM * History Queues -1.2 S. 00.6 ~0.4 S0.2 BA VOLREND RNES RADIX OCEAN-NC LU-C PACESIM CANNEAL WATER-NSQ BLACKSCII. SWAPTIONS 1.2 S0.8 00.6 C BB I- I I It 0. 2 0j ow: UZ DEDUP PATRICIA 01 U liffli iiiih tt - =,wZ 4e = ZW W Z Z BODYTRACK -Iluz CONCOMP COMMUNITY Cw, UZ ?< DIJKSTRA-AP 0 uj 2 Z TSP 6W V wZ n Z 6=ur-8 =WZ ccZ DFS MATMUL I'oil U-Z 5 AVERAGE Figure 5-7: Energy breakdown for the schemes evaluated. Results are normalized to that of R-NUCA. Note that Average and not Geometric-Mean is plotted here. and read-only/read-write at page-granularity and only applies serialization to shared read-write data (c.f. Section 5.6). 4. Timestamp-based Consistency Validation (TS): Executes loads speculatively using timestamp-based validation (c.f. Section 5.4) for shared read-write data. Shared read-only and private data are handled as in the NC scheme. 5. Timestamp-based + Stall Fence (TS-STF): Same as TS but the micro-op dispatch is stalled till a fence completes to ensure Fence -+Load ordering. The L1LHQ & L2LHQ are not required for detecting violations here. It has lower hardware overhead than TS but potentially lower performance due to stalling on fence operations. 6. No Speculation Violations (IDEAL): Same as TS but speculation violations are ignored. It provides the upper limit on performance and energy consumption. The LI & L2 history queues are not required since no speculation failure checks 111 are made. The completion time and energy consumption of the above schemes are plotted in Figures 5-6 & 5-7 respectively. Completion Time: The parallel completion time is broken down into the following 8 categories: 1. Instructions: Number of instructions in the application. 2. Li-I Fetch Stalls: Stall time due to instruction cache misses. 3. Compute Stalls: Stall time due to waiting for functional unit (ALU, FPU, Multiplier, etc.) results. 4. Memory Stalls: Stall time due to load/store queue capacity limits, fences and waiting for loads. 5. Load Speculation: Stall time due to memory consistency violations caused by speculative loads. 6. Branch Speculation: Stall time due to mis-predicted branch instructions. 7. Synchronization: Stall time due to waiting on locks, barriers and condition variables. 8. Idle: Initial time spent waiting for a thread to be spawned. Benchmarks with a high private cache miss rate such COMP, and DEDUP BARNES, OCEAN-NC, CON- do not perform well with the SER scheme. This is because the cache misses cause the load and store queues to fill up, thereby stalling the pipeline. This can be understood by observing the fraction of memory stalls in the completion time of these benchmarks (with the SER scheme). In addition, these benchmarks contain a significant degree of synchronization as well. This causes the memory stalls in one thread to increase the synchronization penalty of threads waiting on it, thereby creating a massive slowdown. 112 The performance problems observed by the SER scheme are shared by the NC scheme as well particularly in the BARNES, OCEAN-NC and CANNEAL benchmarks. The NC scheme can only efficiently handle accesses to private and shared read-only data. Since these benchmarks contain a majority of accesses to shared read-write data, the NC scheme performs poorly. The synchronization penalties lead to a massive slowdown for the NC scheme as well. Benchmarks such as DEDUP, PATRICIA, CONCOMP and MATMUL contain a significant number of non-conflicting memory accesses (i.e., accesses to private and shared read-only data), and hence, the NC scheme is able to perform better than the SER scheme. The improved performance is a result of the lower percentage of memory stalls (and related synchronization waits). The TS scheme performs well on all benchmarks and matches the performance of the IDEAL scheme. The TS scheme performs better than Reactive-NUCA on all benchmarks except LU-C, where it performs worse due to store queue stalls obtained from serializing remote stores. Overall, the TS scheme only spends a small amount of time stalling due to load speculation violations. The stall time is only visible in one benchmark, OCEAN-NC and even here, stalling due to speculation violation just replaces the already occurring stalls due to the limited size of the reorder buffer. The TS-STF scheme stalls the dispatch stage on a fence till all previous stores have committed. Hence, it performs poorly in benchmarks with significant number of fences, e.g., CANNEAL and TSP, while it performs well when the number of fences is negligible, e.g., FACESIM and DEDUP. Note that all fences seen in the evaluated benchmarks are implicit fences introduced by atomic operations in x86, e.g., test-andset, compare-and-swap, etc. There were almost no explicit MFENCE instructions observed. Overall, the TS scheme improves performance by 16% compared to RNUCA, and the TS-STF scheme improves performance by 10%. The SER and NC schemes reduce performance by 14% and 11% respectively. Energy: All the locality-aware coherence protocol implementations (i.e., all except RNUCA) are found to significantly reduce L2 cache and network energy due to the 113 following 3 factors: 1. Fetching an entire line on a cache miss is replaced by multiple cheaper word accesses to the shared L2 cache. 2. Reducing the number of private sharers decreases the number of invalidations (and acknowledgments) required to keep all cached copies of a line coherent. Synchronous write-back requests that are needed to fetch the most recent copy of a line are reduced as well. (Note: increasing the value of PCT reduces the number of private sharers and increases the number of remote sharers). 3. Since the caching of low-locality data is eliminated, the Li cache space is more effectively used for high locality data, thereby decreasing the amount of asynchronous evictions (that lead to capacity misses) for such data. Among the locality-aware coherence protocol implementations, SER, NC, and IDEAL exhibit the best dynamic energy consumption. Dynamic energy consump- tion increases when TS-STF is used and increases even further when TS is used. This is because both these implementations modify and access the Li & L2 cache history queues. The TS-STF scheme only requires the store history queues since it stalls on a fence while the TS scheme requires both load & store history queues to perform consistency checks, thereby creating a larger energy overhead. Note that page-classification is used in both the TS & TS-STF schemes to ensure that history queue modification & access is only done for shared read-write data since accesses to private & shared read-only data cannot cause consistency violations. Overall, the TS, TS-STF, SER & NC schemes reduce energy by 21%, 23%, 25% and 25% respectively over the RNUCA baseline. 5.8.2 Sensitivity to PCT In this section, we study the impact of the Private Caching Threshold (PCT) parameter on the overall system performance. PCT controls the percentage of remote and private cache accesses in the system. 114 A higher PCT increases the percentage -+-Completion Time -WEnergy --- 1.1 1 -. 0.9 0.8 0.7 ~ - 1 2 ~ ~ T 3 4 YT 6 8 10 12 14 16 Private Caching Threshold (PCT) Figure 5-8: Completion Time and Energy consumption as PCT varies from 1 to 16. Results are normalized to a PCT of 1 (i.e., Reactive-NUCA protocol). of remote accesses while a lower PCT increases the percentage of cache line fetches into the private cache. Finding the optimal value of PCT is paramount to system performance. We plot the geometric means of the Completion Time and Energy for our benchmarks as a function of PCT in Figure 5-8. We observe a gradual decrease in completion time till a PCT of 4, constant completion time till a PCT of 8 and then a gradual increase afterward. Energy consumption reduces steadily till a PCT of 4, reaches a global minimum at 4 and then increases steadily afterward. Varying PCT impacts energy consumption by changing both network traffic and cache accesses. Fetching an entire line on a cache miss requires moving 10 flits over the network. On the other hand, fetching a word only costs 3 flits while writing a word to a remote cache costs 5 flits (due to two-phase stores). As PCT increases, cache line fetches are traded-off with increasing numbers of remote-word accesses. This causes the network traffic to first reduce and then increase as PCT increases, thus explaining the trends in energy consumption. The completion time shows an initial gradual reduction due to lower network contention. The gradual increase afterward is due to increased network traffic at higher values of PCT. Overall, the locality-aware protocol obtains a 16% completion time reduction and a 21% energy reduction when compared to the Reactive-NUCA baseline. 115 Sensitivity to History Retention Period (HRP) 5.8.3 , E - 1.2 1 U HRP-64 - 0"HRP-128 0.8 UHRP-256 R0.6 0.4 o U HRP-512 - HRP-1024 0 R 41 *\ HRP-2048 Cd0 ~ P-0 a*HRP-4096 Figure 5-9: Completion Time sensitivity to History Retention Period (HRP) as HRP varies from 64 to 4096. In this section, the impact of the History Retention Period (HRP) on system performance is studied. Figure 5-9 plots the completion time as a function of HRP. A small value of HRP reduces the size requirement of the load/store history queues at the Li and L2 caches (L1LHQ, L1SHQ, L2LHQ & L2SHQ) as described in Section 6.2.3. A small HRP also reduces network traffic since timestamps are less likely to be found in the history queues, and thus less likely to be communicated in a network message. However, a small HRP also discards history information faster, requiring the mechanism to make a conservative assumption regarding the time the last loads/stores were made (cf. Section 5.4.3). This increases the chances of the speculation check failing, thereby increasing completion time. From Figure 5-9, we observe that an HRP of 64 performs considerably worse when compared to the other data points. The high completion time is due to the instruction replays incurred due to speculation violation. HRP values of 128 and 256 reduce the speculation violations, and thereby improve performance considerably. However, we opted to go for an HRP of 512 since its performance is within ~ 1% of an HRP of 4096. 116 5.9 Summary This chapter studied the efficiency and programmability tradeoffs with a state-of-theart data access mechanism called remote access. Complex cores and strong memory models impose memory ordering restrictions that have been efficiently managed in traditional coherence protocols. Remote access introduces serialization penalties, which hampers the memory-level parallelism (MLP) in an application. A timestamp-based speculation scheme is proposed that enables remote accesses to be issued and completed in parallel while continuously detecting whether any ordering violations have occurred and rolling back the pipeline state (if needed). The scheme is implemented for a state-of-the-art locality-aware cache coherence protocol that uses remote access as an auxiliary mechanism for efficient data access. The evaluation using a 64-core multicore with out-of-order speculative cores shows that our proposed technique improves completion time by 16% and energy by 21% over a state-of-the-art cache management scheme while requiring only 2.5 KB storage overhead per-core. 117 118 Chapter 6 Locality-aware LLC Replication Scheme This thesis proposes a data replication mechanism for the LLC that retains the onchip cache utilization of the shared LLC while intelligently replicating cache lines close to the requesting cores so as to maximize data locality. To achieve this goal, a low-overhead yet highly accurate in-hardware locality classifier is proposed that operates at the cache line granularity and only allows the replication of cache lines with high reuse. This classifier captures the LLC pressure and adapts its replication decision accordingly. This chapter is organized as follows. Section 6.1 motivates data replication in the LLC. Section 6.2 describes a detailed implementation of locality-aware replication. Section 6.3 discusses the rationale behind key design decisions. Section 6.4 designs a mechanism that performs replication at the cluster-level and discusses the efficacy of this approach. Section 6.5 describes the evaluation methodology. Section 6.6 presents the results. And finally, Section 6.7 provides a brief summary of this chapter. 119 Private [1-2] # Instruction [1-2] Shared Read-Only [1-2] Shared Read-Write [1-2] 100% U Private [3-9] N Private [>10] U Instruction [3-9] U Instruction [ 10] E Shared Read-Only [3-9] U Shared Read-Only [>10] N Shared Read-Write [3-9] U Shared Read-Write [>10] 80% 60% 40% 20% -i 0% Figure 6-1: Distribution of instructions, private data, shared read-only data, and shared read-write data accesses to the LLC as a function of run-length. The classification is done at the cache line granularity. 6.1 6.1.1 Motivation Cache Line Reuse The utility of data replication at the LLC can be understood by measuring cache line reuse. Figure 6-1 plots the distribution of the number of accesses to cache lines in the LLC as a function of run-length. Run-length is defined as the number of accesses to a cache line (at the LLC) from a particular core before a conflicting access by another core or before it is evicted. Cache line accesses from multiple cores are conflicting if at least one of them is a write. The L2 cache accesses are broken down into the following four categories: (1) Instruction, (3) Private Data, (3) Shared Read-Only (RO) Data, and (4) Shared Read- Write (RW) Data For example, in BARNES, over 90% of the accesses to the LLC occur to shared (read-write) data that has a run-length of 10 or more. Greater the number of accesses with higher run-length, greater is the benefit of replicating the cache line in the requester's LLC slice. Hence, from replicating shared (read-write) data. replicating instructions and PATRICIA Similarly, BARNES FACESIM would benefit would benefit from would benefit from replicating shared (read- only) data. On the other hand, FLUIDANIMATE and OCEAN-C would not benefit since 120 most cache lines experience just 1 or 2 accesses to them before a conflicting access or an eviction. For such cases, replication would increase the LLC pollution without improving data locality. For such benchmarks, replication in the LLC would increase memory access latency and invalidation penalty without improving data locality. Hence, the replication decision should not depend on the type of data, but rather on its locality. Instructions and shared-data (both read-only and read-write) can be replicated if they demonstrate good reuse. It is also important to adapt the replication decision at runtime in case the reuse of data changes during an application's execution. 6.1.2 Cluster-level Replication Another design option to implement replication at the LLC (/L2 cache) is to perform replication at the cluster-level instead of creating a replica per-core, i.e., group a set of LLC slices into a cluster and create at-most one replica per cluster. In order to gain more insight into this approach, the accesses to the L2 cache are studied as a function of the sharing degree. The sharing degree of a cache line is defined as the number of sharers when it is accessed. Figures 6-2 and 6-3 plot this information for four applications. The L2 cache accesses are broken down into the following four categories: (1) Instruction, (2) Shared Read-Only (RO) Data, (3) Shared Read-Write (RW) Data Read and (4) Shared Read- Write (RW) Data Write. Private Data Read and Private Data Write are special cases of Shared Read-Write Data read and write with number of sharers equal to 1. The majority of L2 cache accesses in BARNES and BODYTRACK are read accesses to widely shared data. In BARNES, all the accesses are to shared read-write data (this type of data is frequently read and sparsely written). However, in BODYTRACK, the accesses are equally divided up between instructions, shared read-only and shared read-write data. In RAYTRACE and VOLREND, most L2 cache accesses are reads to shared read-only data. However, the sharing degree of the cache lines that are read is variable. The sharing degree of cache lines is an important parameter when deciding the cluster size for replication. While cache lines with a high sharing degree can be shared by all neighboring cores, cache lines with a limited sharing degree can only be 121 Instruction Shared R3 Data Shared RN Data Read Shared RN Data Mrite 7e+96 60+06 5e+86 M -6 c'J -J 36M 2e m ie+06 a 1, 4, 7, 10, 13, 15, 19, 22, 25, 28, 31, 34, 37, 48, 43, 46, 49, 52, 55, 58, Nunber of Sharers (a) 81, 64, BARNES 3e+87 - Instruction- Shared RD Data Shared RM Data Read Shared RN Data Mrite 2.5e+87 20+87 -ii 1.5e+87 | _j le+67 50+06 L a 1, 4, 7, 19, 13, 16, 19, 22, 25, 28, 31, 34, 37, 46, 43, 46, 49, 52, 55, 58, 61. 64, Nunber of Sharers (b) BODYTRACK Figure 6-2: Distribution of the accesses to the shared L2 cache as a function of the sharing degree. The accesses are broken down into (1) Instruction, (2) Shared Read-Only (RO) Data, (3) Shared Read- Write (RW) Data Read and (4) Shared Read- Write (RW) Data Write. 122 6908 Instruction Shared RO Data Shared RM Data Read Shared R Data Nrite 1 50e99 I dJ I U seee 2009M m 1, 4. I 7, to, 13, 16, 19, 22, 25, 25, 31, 34, 37, 48, 43, 46, 49, 52, 55, 58, 61, 64, Nunber of Sharers (a) RAYTRACE 608008 Instruction Shared RD Data Shared RN Data Read Shared RN Data riteo 00000 C1 .1 309999 some 2999. 199889 9 1, 4, 7, 19, 13, 16, 19, 22, 25, 26, 31, 34, 37, 49, 43, 46, 49, 52, 55, 58, 61, 54, Nunber of Sharers (b) VOLREND Figure 6-3: Distribution of the accesses to the shared L2 cache as a function of the sharing degree. The accesses are broken down into (1) Instruction, (2) Shared Read-Only (RO) Data, (3) Shared Read- Write (RW) Data Read and (4) Shared Read- Write (RW) Data Write. 123 shared by a restricted number of cores. Increasing the number of replicas improves the data locality (i.e., L2 hit latency) but also increases the off-chip miss rate. Decreasing the number of replicas has the opposite effect. This thesis explores a mechanism to implement cluster-level replication in Section 6.4. 6.1.3 Proposed Idea We propose a low-overhead yet highly accurate hardware-only predictive mechanism to track and classify the reuse of each cache line in the LLC. Our runtime classifier only allows replicating those cache lines that demonstrate reuse at the LLC while bypassing replication for others. When a cache line replica is evicted or invalidated, our classifier adapts by adjusting its future replication decision accordingly. This reuse tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional cache coherence protocols. The locality-aware protocol is advantageous because it: 1. Enables lower memory access latency and energy by selectively replicating cache lines that show high reuse in the LLC slice of the requesting core. 2. Better exploits the LLC by balancing the off-chip miss rate and on-chip locality using a classifier that adapts to the runtime reuse at the granularity of cache lines. 3. Allows coherence complexity almost identical to that of a traditional nonhierarchical (flat) coherence protocol since replicas are only allowed to be placed at the LLC slice of the requesting core. The additional coherence complexity only arises within a core when the LLC slice is searched on an Li cache miss, or when a cache line in the core's local cache hierarchy is evicted/invalidated. 6.2 Locality-Aware LLC Data Replication We describe how the locality-aware protocol works by implementing it on top of the baseline described in Chapter 2. 124 E p1 2 E3 --AEl A, Co m pute :)_ Private Li Caches i L2 Cache (LLC Slice) 5 Pipeline licepli M E:]:]-1 =: lj T i F-1 Router LII I Home are mockup requests showing the locality-aware LLC replication proFigure 6-4: ( tocol. The black data block has high reuse and a local LLC replica is allowed that services requests from T and g. The low-reuse red data block is not allowed to be replicated at the LLC, and the request from @ that misses in the L1, must access the LLC slice at its home core. The home core for each data block can also service local private cache misses (e.g., 6.2.1 Protocol Operation The four essential components of data replication are: (1) choosing which cache lines to replicate, (2) determining where to place a replica, (3) how to lookup a replica, and (4) how to maintain coherence for replicas. We first define a few terms to facilitate describing our protocol. 1. Home Location: The core where all requests for a cache line are serialized for maintaining coherence. 2. Replica Sharer: A core that is granted a replica of a cache line in its LLC slice. 3. Non-Replica Sharer: A core that is NOT granted a replica of a cache line in its LLC slice. 4. Replica Reuse: The number of times an LLC replica is accessed before it is invalidated or evicted. 125 Home Reuse >= RT Initial No No Replica Replica Home Reuse < RT XReuse >= RT XReuse < RT Figure 6-5: Each directory entry is extended with replication mode bits to classify the usefulness of LLC replication. Each cache line is initialized to non-replica mode with respect to all cores. Based on the reuse counters (at the home as well as the replica location) and the parameter RT, the cores are transitioned between replica and non-replica modes. Here XReuse is (Replica + Home) Reuse on an invalidation and Replica Reuse on an eviction. 5. Home Reuse: The number of times a cache line is accessed at the LLC slice in its home location before a conflicting write or eviction. 6. Replication Threshold (RT): The reuse above or equal to which a replica is created. Note that for a cache line, one core can be a replica sharer while another can be a non-replica sharer. Our protocol starts out as a conventional directory protocol and initializes all cores as non-replica sharers of all cache lines (as shown by Initial in Figure 6-5). Let us understand the handling of read requests, write requests, evictions, invalidations and downgrades as well as cache replacement policies under this protocol. Read Requests On an Li cache read miss, the core first looks up its local LLC slice for a replica. If a replica is found, the cache line is inserted at the private Li cache. In addition, a Replica Reuse counter (as shown in Figure 6-6) at the LLC directory entry is incremented. The replica reuse counter is a saturating counter used to capture reuse information. It is initialized to '1' on replica creation and incremented on every replica hit. On the other hand, if a replica is not found, the request is forwarded to the LLC home location. If the cache line is not found there, it is either brought in from the 126 Replica Reuse ACKWise Pointers (1 ... p) Model Mode Home Reuse, ...Home Reusen Complete Locality List (1 .. n) Figure 6-6: ACKwisep-Complete locality classifier LLC tag entry. It contains the tag, LRU bits and directory entry. The directory entry contains the state, ACKwisep pointers, a Replica reuse counter as well as Replication mode bits and Home reuse counters for every core in the system. off-chip memory or the underlying coherence protocol takes the necessary actions to obtain the most recent copy of the cache line. The directory entry is augmented with additional bits as shown in Figure 6-6. These bits include (a) Replication Mode bit and (b) Home Reuse saturating counter for each core in the system. Note that adding several bits for tracking the locality of each core in the system does not scale with the number of cores, therefore, we will present a cost-efficient classifier implementation in Section 6.2.2. The replication mode bit is used to identify whether a replica is allowed to be created for the particular core. The home reuse counter is used to track the number of times the cache line is accessed at the home location by the particular core. This counter is initialized to '0' and incremented on every hit at the LLC home location. If the replication mode bit is set to true, the cache line is inserted in the requester's LLC slice and the private Li cache. Otherwise, the home reuse counter is incremented. If this counter has reached the Replication Threshold (RT), the requesting core is "promoted" (the replication mode bit is set to true) and the cache line is inserted in its LLC slice and private Li cache. If the home reuse counter is still less than RT, a replica is not created. The cache line is only inserted in the requester's private Li cache. If the LLC home location is at the requesting core, the read request is handled directly at the LLC home. Even if the classifier directs to create a replica, the cache line is just inserted at the private Li cache. 127 Write Requests On an Li cache write miss for an exclusive copy of a cache line, the protocol checks the local LLC slice for a replica. If a replica exists in the Modified(M) or Exclusive(E) state, the cache line is inserted at the private Li cache. In addition, the Replica Reuse counter is incremented. If a replica is not found or exists in the Shared(S) state, the request is forwarded to the LLC home location. The directory invalidates all the LLC replicas and Li cache copies of the cache line, thereby maintaining the single-writer multiple-reader invariant [86]. The acknowledgements received are processed as described in Sec- tion 6.2.1. After all such acknowledgements are processed, the Home Reuse counters of all non-replica sharers other than the writer are reset to '0'. This has to be done since these sharers have not shown enough reuse to be "promoted". If the writer is a non-replica sharer, its home reuse counter is modified as follows. If the writer is the only sharer (replica or non-replica), its home reuse counter is incremented, else it is reset to '1'. This enables the replication of migratory shared data at the writer, while avoiding it if the replica is likely to be downgraded due to conflicting requests by other cores. Evictions and Invalidations On an invalidation request, both the LLC slice and Li cache on a core are probed and invalidated. If a valid cache line is found in either caches, an acknowledgement is sent to the LLC home location. In addition, if a valid LLC replica exists, the replica reuse counter is communicated back with the acknowledgement. The locality classifier uses this information along with the home reuse counter to determine whether the core stays as a replica sharer. If the (replica + home) reuse is > RT, the core maintains replica status, else it is demoted to non-replica status (as shown in Figure 6-5). The two reuse counters have to be added since this is the total reuse that the core exhibited for the cache line between successive writes. When an Li cache line is evicted, the LLC replica location is probed for the same 128 address. If a replica is found, the dirty data in the Li cache line is merged with it, else an acknowledgement is sent to the LLC home location. However, when an LLC replica is evicted, the Li cache is probed for the same address and invalidated. An acknowledgement message containing the replica reuse counter is sent back to the LLC home location. The replica reuse counter is used by the locality classifier as follows. If the replica reuse is > RT, the core maintains replica status, else it is demoted to non-replica status. Only the replica reuse counter has to be used for this decision since it captures the reuse of the cache line at the LLC replica location. After the acknowledgement corresponding to an eviction or invalidation of the LLC replica is received at the home, the locality classifier sets the home reuse counter of the corresponding core to '0' for the next round of classification. The eviction of an LLC replica back-invalidates the Li cache (as described earlier). A possibly more optimal strategy is to maintain the validity of the Li cache line. This requires two additional message types, one to communicate back the reuse counter on the LLC replica eviction and another to communicate the acknowledgement when the Li cache line is finally invalidated or evicted. We opted for the back-invalidation for the following two reasons: 1. To maintain the simplicity of the coherence protocol 2. Since the energy and performance improvements of the more optimal strategy are negligible. This is because: (a) the LLC size is > 4x the Li cache size, thereby keeping the probability of evicted LLC lines having an Li copy extremely low, and (b) the LLC replacement policy implemented prioritizes retaining cache lines that have Li cache copies. LLC Replacement Policy Traditional LLC replacement policies use the least recently used (LRU) policy. One reason why this is sub-optimal is that the LRU information cannot be fully captured at the LLC because the Li cache filters out a large fraction of accesses that hit within it. In order to be cognizant of this, the replacement policy should prioritize retaining 129 Core Tag LRU State Replica Reuse ACKWise Pointers (1... p) ID1 MoeI ___.l_. Home Reuse- - Core lDk Coek Home Reusek Limited Locality List (1 .. k) Figure 6-7: ACKwisep-Limitedk locality classifier LLC tag entry. It contains the tag, LRU bits and directory entry. The directory entry contains the state, ACKwisep pointers, a Replica reuse counter as well as the Limitedk classifier. The Limitedk classifier contains a Replication mode bit and Home reuse counter for a limited number of cores. A majority vote of the modes of tracked cores is used to classify new cores as replicas or non-replicas. cache lines that have Li cache sharers. Some proposals in literature accomplish this by sending periodic Temporal Locality Hint messages from the LI cache to the LLC [441. However, this incurs additional network traffic. Our replacement policy accomplishes the same using a much simpler scheme. It first selects cache lines with the least number of L1 cache copies and then chooses the least recently used among them. The number of Li cache copies is readily available since the directory is integrated within the LLC tags ("in-cache" directory). This reduces back invalidations to a negligible amount and outperforms the LRU policy (cf. Section 6.6.2). 6.2.2 Limited Locality Classifier Optimization The classifier described earlier which keeps track of locality information for all the cores in the directory entry is termed the Complete locality classifier. It has a storage overhead of 30% (calculated in Section 6.2.3) at 64 cores and over 5x at 1024 cores. In order to mitigate this overhead, we develop a classifier that maintains locality information for a limited number of cores and classifies the other cores as replica or non-replica sharers based on this information. The locality information for each core consists of (1) the core ID, (2) the replication mode bit and (3) the home reuse counter. The classifier that maintains a list of this information for a limited number of cores (k) is termed the Limitedk classifier. Figure 6-7 shows the information that is tracked by this classifier. The sharer list of 130 the ACKwise limited directory entry cannot be reused for tracking locality information because of its different functionality. While the hardware pointers of ACKwise are used to maintain coherence, the limited locality list serves to classify cores as replica or non-replica sharers. Decoupling in this manner also enables the locality-aware protocol to be implemented efficiently on top of other scalable directory organizations. We now describe the working of the limited locality classifier. At startup, all entries in the limited locality list are free and this is denoted by marking all core IDs' as Invalid. When a core makes a request to the home location, the directory first checks if the core is already being tracked by the limited locality list. If so, the actions described previously are carried out. Else, the directory checks if a free entry exists. If it does exist, it allocates the entry to the core and the same actions are carried out. Otherwise, the directory checks if a currently tracked core can be replaced. An ideal candidate for replacement is a core that is currently not using the cache line. Such a core is termed an inactive sharer and should ideally relinquish its entry to a core in need of it. A replica core becomes inactive on an LLC invalidation or an eviction. A non-replica core becomes inactive on a write by another core. If such a replacement candidate exists, its entry is allocated to the requesting core. The initial replication mode of the core is obtained by taking a majority vote of the modes of the tracked cores. This is done so as to start off the requester in its most probable mode. Finally, if no replacement candidate exists, the mode for the requesting core is obtained by taking a majority vote of the modes of all the tracked cores. The limited locality list is left unchanged. The storage overhead for the Limitedk classifier is directly proportional to the number of cores (k) for which locality information is tracked. In Section 6.6.3, we evaluate the storage and accuracy tradeoffs for the Limitedk classifier. Based on our observations, we pick the Limited3 classifier. 131 6.2.3 Overheads Storage The locality-aware protocol requires extra bits at the LLC tag arrays to track locality information. Each LLC directory entry requires 2 bits for the replica reuse counter (assuming an optimal RT of 3). The Limited3 classifier tracks the locality information for three cores. Tracking one core requires 2 bits for the home reuse counter, 1 bit to store the replication mode and 6 bits to store the core ID (for a 64-core processor). Hence, the Limited3 classifier requires an additional 27 (= 3 x 9) bits of storage per LLC directory entry. The Complete classifier, on the other hand, requires 192 (= 64 x 3) bits of storage. All the following calculations are for one core but they are applicable for the entire processor since all the cores are identical. The sizes of the per-core LI and LLC caches used in our system are shown in Table 2.1. The storage overhead of the replica reuse bit is is 256 2x 64 256= x8 = 13.5KB. 1KB. The storage overhead of the Limited3 classifier For the complete classifier, it is 192 256 - 96KB. Now, the storage overhead of the A CKwise4 protocol in this processor is 12KB (assuming 6 bits per ACKwise pointer) and that for a Full Map protocol is 32KB. Adding up all the storage components, the Limited3 classifier with A CKwise4 protocol uses slightly less storage than the Full Map protocol and 4.5% more storage than the baseline A CKwise4 protocol. The Complete classifier with the A CKwise4 protocol uses 30% more storage than the baseline A CKwise4 protocol. LLC Tag & Directory Accesses Updating the replica reuse counter in the local LLC slice requires a read-modify-write operation on each replica hit. However, since the replica reuse counter (being 2 bits) is stored in the LLC tag array that needs to be written on each LLC lookup to update the LRU counters, our protocol does not add any additional tag accesses. At the home location, the lookup/update of the locality information is performed concurrently with the lookup/update of the sharer list for a cache line. However, the 132 lookup/update of the directory is now more expensive since it includes both sharer list and the locality information. This additional expense is accounted for in our evaluation. Network Traffic The locality-aware protocol communicates the replica reuse counter to the LLC home along with the acknowledgment for an invalidation or an eviction. This is accom- plished without creating additional network flits. For a 48-bit physical address and 64-bit flit size, an invalidation message requires 42 bits for the physical cache line address, 12 bits for the sender and receiver core IDs and 2 bits for the replica reuse counter. The remaining 8 bits suffice for storing the message type. 6.3 6.3.1 Discussion Replica Creation Strategy In the protocol described earlier, replicas are created in all valid cache states. simpler strategy is to create an LLC replica only in the Shared cache state. A This enables instructions, shared read-only and shared read-write data that exhibit high read run-length to be replicated so as to serve multiple read requests from within the local LLC slice. However, migratory shared data cannot be replicated with this simpler strategy because both read and write requests are made to it in an interleaved manner. Such data patterns can be efficiently handled only if the replica is created in the Exclusive or Modified state. Benchmarks that exhibit both the above access patterns are observed in our evaluation (cf. Section 6.6.1). 6.3.2 Coherence Complexity The local LLC slice is always looked up on an Li cache miss or eviction. Additionally, both the Li cache and LLC slice is probed on every asynchronous coherence request (i.e., invalidate, downgrade, flush or write-back). This is needed because the directory 133 only has a single pointer to track the local cache hierarchy of each core. This method also allows the coherence complexity to be similar to that of a non-hierarchical (flat) coherence protocol. To avoid the latency and energy overhead of searching the LLC replica, one may want to optimize the handling of asynchronous requests, or decide intelligently whether to lookup the local LLC slice on a cache miss or eviction. In order to enable such optimizations, additional sharer tracking bits are needed at the directory and Li cache. Moreover, additional network message types are needed to relay coherence information between the LLC home and other actors. In order to evaluate whether this additional coherence complexity is worthwhile, we compared our protocol to a dynamic oracle that has perfect information about whether a cache line is present in the local LLC slice. The dynamic oracle avoids all unnecessary LLC lookups. The completion time and energy difference when compared to the dynamic oracle was less than 1%. Hence, in the interest of avoiding the additional complexity, the LLC replica is always looked up for the above coherence requests. 6.3.3 Classifier Organization The classifier for the locality-aware protocol is organized using an in-cache structure, i.e., the replication mode bits and home reuse counters are maintained for all cache lines in the LLC. However, this is a not an essential requirement. The classifier is logically decoupled from the directory and could be implemented using a sparse organization. The storage overhead for the in-cache organization is calculated in Section 6.2.3. The performance and energy overhead for this organization is small because: (1) The classifier lookup incurs a relatively small energy and latency penalty when compared to the data array lookup of the LLC slice and communication over the network (justified in our results). (2) Only a single tag lookup is needed for accessing the classifier and LLC data. In a sparse organization, a separate lookup is required for the classifier and the LLC data. Even though these lookups could be performed in parallel with 134 no latency overhead, the energy expended to lookup two CAM structures needs to be paid. 6.4 Cluster-Level Replication In the locality-aware protocol, the location where a replica is placed is always the LLC slice of the requesting core. An additional method by which one could explore the trade-off between LLC hit latency and LLC miss rate is by replicating at a clusterlevel. A cluster is defined as a group of neighboring cores where there is at most one replica for a cache line. Each such replica would service the misses of all the Li caches in the same cluster. Increasing the size of a cluster would increase LLC hit latency and decrease LLC miss rate, and decreasing the cluster size would have the opposite effect. The optimal replication algorithm would optimize the cluster size so as to maximize the performance and energy benefit. We explored the benefits of clustering under our protocol after making the appropriate changes. The changes include the following: 1. Blocking at the replica location (the core in the cluster where a replica could be found) before forwarding the request to the home location so that multiple cores on the same cluster do not have outstanding requests to the LLC home location. & 2. Additional coherence messages for requests and replies between the Li cache LLC replica location and between the LLC replica & LLC home. 3. Additional storage at the directory for differentiating between LLC replicas and Li caches when tracking sharers. 4. Additional storage at the Li cache tag to determine whether an Li cache copy is backed by the LLC replica or the LLC home. 5. Hierarchical invalidation and downgrade of the replica and the Li caches that it tracks. 135 6. Additional coherence message to dequeue the request at the LLC replica location incase the LLC home decides not to replicate (in the LLC) but responds directly to the Li cache. In addition to these changes, two significant observations were made related to the ACKwise limited directory protocol. 1. An imprecise tracking of sharers (e.g., such as ACKwise) can only be done at the LLC home and not at the LLC replica location. At the LLC replica location, a precise tracking is needed (e.g., using a full-map directory protocol). This is done to ensure protocol correctness without additional on-chip network support to stop invalidations from crossing cluster boundaries. 2. While performing broadcast invalidations from the LLC home, only the sharers (i.e., Li caches and LLC replicas that are tracked from the LLC home)' should respond with an acknowledgement to the LLC home. Li caches that are backed by an LLC replica should wait for a matching invalidation before responding. This is done so as to ensure that invalidation requests take the same path as replies from the LLC home -+ LLC replica -+ Li cache. Else, an invalidation could arrive early, i.e., before the reply, leading to a protocol deadlock. Overall, cluster-level replication was not found to be beneficial in the evaluated 64-core system, for the following reasons (see Section 6.6.4 for details). 1. Using clustering increases network serialization delays since multiple locations now need to be searched/ invalidated on an Li cache miss. 2. Cache lines with low degree of sharing do not benefit because clustering just increases the LLC hit latency without reducing the LLC miss rate. 3. The added coherence complexity of clustering increased our design and verification time significantly. 136 6.5 Evaluation Methodology We evaluate a 64-core multicore using in-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The parameters specific to the locality-aware LLC (/L2) replication scheme are shown in Table 6.1. Architectural Parameter Value Replication Threshold Classifier RT= 3 Limited 3 Table 6.1: Locality-aware LLC (/L2) replication parameters 6.5.1 Baseline LLC Management Schemes We model four baseline multicore systems that assume private LI caches managed using the A CKwise4 protocol. 1. The Static-NUCA baseline address interleaves all cache lines among the LLC slices. 2. The Reactive-NUCA [39] baseline places private data at the requester's LLC slice, replicates instructions in one LLC slice per cluster of 4 cores using rotational interleaving, and address interleaves shared data in a single LLC slice. 3. The Victim Replication (VR) [971 baseline uses the requester's local LLC slice as a victim cache for data that is evicted from the Li cache. The evicted victims are placed in the local LLC slice only if a line is found that is either invalid, a replica itself or has no sharers in the LI cache. 4. The Adaptive Selective Replication (ASR) [101 baseline also replicates cache lines in the requester's local LLC slice on an LI eviction. However, it only allows LLC replication for cache lines that are classified as shared readonly. ASR pays attention to the LLC pressure by basing its replication decision on per-core hardware monitoring circuits that quantify the replication effectiveness based on the benefit (lower LLC hit latency) and cost (higher LLC miss 137 latency) of replication. We do not model the hardware monitoring circuits or the dynamic adaptation of replication levels. Instead, we run ASR at five different replication levels (0, 0.25, 0.5, 0.75, 1) and choose the one with the lowest energy-delay product for each benchmark. 6.5.2 Evaluation Metrics Each multithreaded benchmark is run to completion using the input sets from Table 2.3. We measure the energy consumption of the memory system including the on-chip caches, DRAM and the network. We also measure the completion time, i.e., the time in the parallel region of the benchmark. This includes the compute latency, the memory access latency, and the synchronization latency. The memory access latency is further broken down into: 1. Li to LLC replica latency is the time spent by the Li cache miss request to the LLC replica location and the corresponding reply from the LLC replica including time spent accessing the LLC. 2. Li to LLC home latency is the time spent by the Li cache miss request to the LLC home location and the corresponding reply from the LLC home including time spent in the network and first access to the LLC. 3. LLC home waiting time is the queueing delay at the LLC home incurred because requests to the same cache line must be serialized to ensure memory consistency. 4. LLC home to sharers latency is the round-trip time needed to invalidate sharers and receive their acknowledgments. This also includes time spent requesting and receiving synchronous write-backs. 5. LLC home to off-chip memory latency is the time spent accessing memory including the time spent communicating with the memory controller and the queueing delay incurred due to finite off-chip bandwidth. 138 One of the important memory system metrics we track to evaluate our protocol are the various cache miss types. They are as follows: 1. LLC replica hits are Li cache misses that hit at the LLC replica location. 2. LLC home hits are Li cache misses that hit at the LLC home location when routed directly to it or LLC replica misses that hit at the LLC home location. 3. Off-chip misses are Li cache misses that are sent to DRAM because the cache line is not present on-chip. 6.6 6.6.1 Results Comparison of Replication Schemes Figures 6-8 and 6-9 plot the energy and completion time breakdown for the replication schemes evaluated. The RT-1, RT-3 and RT-8 bars correspond to the locality-aware scheme with replication thresholds of 1, 3 and 8 respectively. The energy and completion time trends can be understood based on the following 3 factors: (1) the type of data accessed at the LLC (instruction, private data, shared read-only data and shared read-write data), (2) reuse run-length at the LLC, and (3) working set size of the benchmark. Figure 6-10, which plots how Li cache misses are handled by the LLC is also instrumental in understanding these trends. Many benchmarks (e.g., BARNES) have a working set that fits within the LLC even if replication is done on every LI cache miss. Hence, all locality-aware schemes (RT-1, RT-3 and RT-8) perform well both in energy and performance. In our experiments, we observe that BARNES exhibits a high reuse of cache lines at the LLC through accesses directed at shared read-write data. S-NUCA, R-NUCA and ASR do not replicate shared read-write data and hence do not observe any benefits with BARNES. VR observes some benefits since it locally replicates read-write data. However, it exhibits higher energy and completion time than the locality-aware protocol for the following two reasons. (1) Its (almost) blind process of creating replicas on all 139 " L-1 Cache 0 L1-D Cache E L2 Cache (LLC) " Network Router N Network Link N DRAM 0 Directory -1.2 ji0.8 E0.6 tol .. t I 00.4 -0.2 0 - 12 CHOLESKY LU-NC LU-C FFT RADIX BARNES OCEAN-C 10 0.8 E0.6 O0.4 0.2 02 11111 '=t OCEAN-NC WATER-NSQ 111 >!1I 11111 11111 !Iii1111 VOLREND RAYTRACE BLACKSCH. SWAPTIONS FLUIDANIM. 1.2 1 STREAMCLUS. DEDUP FERRET BODYTRACK FACESIM PATRICIA CONCOMP AVERAGE Figure 6-8: Energy breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA. Note that Average and not Geometric-Mean is plotted here. evictions results in the pollution of the LLC, leading to less space for useful replicas and LLC home lines. This is evident from lower replica hit rate for VR when compared to our locality-aware protocol. (2) The exclusive relationship between the L1 cache and the local LLC slice in VR causes a line to always be written back on an eviction even if the line is clean. This is because a replica hit always causes the line in the LLC slice to be invalidated and inserted into the L1 cache. Hence, in the common case where replication is useful, each hit at the LLC location effectively incurs both a read and a write at the LLC. And a write expends 1.2x more energy than a read. Similar trends in VR performance and energy exist in the WATER-NSQ, PATRICIA, BODYTRACK, FACESIM, STREAMCLUSTER and BLACKSCHOLEs benchmarks. BODYTRACK and FACESIM are similar to BARNEs except that their LLC accesses 140 " Compute " LLC-Home-To-Sharers I LLC-Home--Waiting L1-To-LLC-Home 4 L1-To-LLC-Replica U M LLC-Home-To-OffChip M Synchronization 1.2 E- 1 0. 0.6 ....... **** E '0 0.4.... S0.2 11211111i 0 ZZ T RADIX FFT LU-C LU-NC OCEAN-NC WATER-NSQ RAYTRACE VOLREND CHOLESKY BARNES OCEAN-C SWAPTIONS FLUIDANIM. 1.2 E- 1 0.8 06 00.4 0 0.2 0 BLACKSCH. 1.2 E 1 0.8 06 0.o0.4 0 0.2 0 STREAMCLUS. DEDUP FERRET BODYTRACK FACESIM PATRICIA CONCOMP AVERAGE Figure 6-9: Completion Time breakdown for the LLC replication schemes evaluated. Results are normalized to that of S-NUCA. Note that Average and not Geometric-Mean is plotted here. have a greater fraction of instructions and/or shared read-only data. The accesses to shared read-write data are again mostly reads with only a few writes. R-NUCA shows significant benefits since it replicates instructions. ASR shows even higher energy and performance benefits since it replicates both instructions and shared read-only data. The locality-aware protocol also shows the same benefits since it replicates all classes of cache lines, provided they exhibit reuse in accordance with the replication thresholds. VR shows higher LLC energy for the same reasons as in BARNES. ASR and our locality-aware protocol allow the LLC slice at the replica location to be inclusive of the Li cache, and hence, do not have the same drawback as VR. VR, 141 E LLC-Replica-Hits 0 OffChip-Misses W LLC-Home-Hits 10 0% M 8 0% M" 6 0% 4 0% 0 S 2 0% 0 0% Z 2F 'U W' BARNES CHOLESKY LU-NC LU-C FFT RADIX OCEAN-C 10 0% IA WA 8 0% 6 0% 4 0% 0 S 2 0% 0% Z Z WATER-NSQ OCEAN-NC C 100% 0 80% 30 'U a' >.l 22 RAYTRACE VOLREND 2 2; BLACKSCH. SWAPTIONS FLUIDANIM. 60% 40% a' 0 20% 0% -I 2F STREAMCLUS. DEDUP FERRET BODYTRACK 2; 2F 2F FACESIM PATRICIA CONCOMP AVERAGE Figure 6-10: Li Cache Miss Type breakdown for the LLC replication schemes evaluated. however, does not have a performance overhead because the evictions are not on the critical path of the processor pipeline. Note that BODYTRACK, FACESIM and RAYTRACE are the only three among the evaluated benchmarks that have a significant Li-I cache MPKI (misses per thousand instructions). All other benchmarks have an extremely low Li-I MPKI (< 0.5) and hence R-NUCA's replication mechanism is not effective in most cases. Even in the above 3 benchmarks, R-NUCA does not place instructions in the local LLC slice but replicates them at a cluster level, hence the serialization delays to transfer the cache lines over the network still need to be paid. BLACKSCHOLES, on the other hand, exhibits a large number of LLC accesses to 142 private data and a small number to shared read-only data. Since R-NUCA places private data in its local LLC slice, it obtains performance and energy improvements over S-NUCA. However, the improvements obtained are limited since false sharing is exhibited at the page-level, i.e., multiple cores privately access non-overlapping cache lines in a page. Since R-NUCA's classification mechanism operates at a page-level, it is not able to locally place all truly private lines. The locality-aware protocol obtains improvements over R-NUCA by replicating these cache lines. ASR only replicates shared read-only cache lines and identifies these lines by using a per cache-line sticky Shared bit. Hence, ASR follows the same trends as S-NUCA. DEDUP almost exclusively accesses private data (without any false sharing) and hence, performs optimally with R-NUCA. Benchmarks such as RADIX, FFT, LU-C, OCEAN-C, FLUIDANIMATE and CONCOMP do not benefit from replication and hence the baseline R-NUCA performs optimally. R-NUCA does better than S-NUCA because these benchmarks have significant accesses to thread-private data. ASR, being built on top of S-NUCA, shows the same trends as S-NUCA. VR, on the other hand, shows higher LLC energy because of the same reasons outlined earlier. VR's replication of private data in its local LLC slice is also not as effective as R-NUCA's policy of placing private data locally, especially in OCEAN-C, FLUIDANIMATE and CONCOMP whose working sets do not fit in the LLC. The locality-aware protocol benefits from the optimizations in R-NUCA and tracks its performance and energy consumption. For the locality-aware protocol, an RT of 3 dominates an RT of 1 in FLUIDANIMATE because it demonstrates significant off-chip miss rates (as evident from its energy and completion time breakdowns) and hence, it is essential to balance on-chip locality with off-chip miss rate to achieve the best energy consumption and performance. While an RT of 1 replicates on every LI cache miss, an RT of 3 replicates only if a reuse > 3 is demonstrated. Using an RT of 3 reduces the off-chip miss rate in FLUIDANIMATE and provides the best performance and energy consumption. Using an RT of 3 also provides the maximum benefit in benchmarks such as OCEAN-C and OCEAN-NC. As RT increases, the off-chip miss rate decreases but the LLC hit latency increases. 143 For example, with an RT of 8, STREAMCLUSTER shows an increased completion time and network energy caused by repeated fetches of the cache line over the network. An RT of 3 would bring the cache line into the local LLC slice sooner, avoiding the unnecessary network traffic and its performance and energy impact. This is evident from the smaller "L1-To-LLC-Home" component of the completion time breakdown graph and the higher number of replica hits when using an RT of 3. We explored all values of R T between 1 & 8 and found that they provide no additional insight beyond the data points discussed here. LU-NC exhibits migratory shared data. Such data exhibits exclusive use (both read and write accesses) by a unique core over a period of time before being handed to its next accessor. Replication of migratory shared data requires creation of a replica in an Exclusive coherence state. The locality-aware protocol makes LLC replicas for such data when sufficient reuse is detected. Since ASR does not replicate shared read-write data, it cannot show benefit for benchmarks with migratory shared data. VR, on the other hand, (almost) blindly replicates on all Li evictions and performs on par with the locality-aware protocol for LU-NC. To summarize, the locality-aware protocol provides better energy consumption and performance than the other LLC data management schemes. It is important to balance the on-chip data locality and off-chip miss rate and overall, an R T of 3 achieves the best trade-off. It is also important to replicate all types of data and selective replication of certain types of data by R-NUCA (instructions) and ASR (instructions, shared read-only data) leads to sub-optimal energy and performance. Overall, the locality-aware protocol has a 16%, 14%, 13% and 21% lower energy and a 4%, 9%, 6% and 13% lower completion time compared to VR, ASR, R-NUCA and S-NUCA respectively. 6.6.2 LLC Replacement Policy As discussed earlier in Section 6.2.1, we propose to use a modified-LRU replacement policy for the LLC. It first selects cache lines with the least number of sharers and then chooses the least recently used among them. This replacement policy improves 144 energy consumption over the traditional LRU policy by 15% and 5%, and lowers completion time by 5% and 2% in the BLACKSCHOLES and FACESIM benchmarks respectively. In all other benchmarks this replacement policy tracks the LRU policy. 6.6.3 Limited Locality Classifier U k=1 k=3 0 k=5 1.4 k=7 k=64 0.4 * 1.2 N E 0.8 S0.6 E 0 r_ 0.2 0 E 0 Figure 6-11: Energy and Completion Time for the Limitedk classifier as a function of number of tracked sharers (k). The results are normalized to that of the Complete (=z Limited 64 ) classifier. Figure 6-11 plots the energy and completion time of the benchmarks with the Limitedk classifier when k is varied as (1, 3, 5, 7, 64). k =64 corresponds to the Complete classifier. The results are normalized to that of the Complete classifier. The benchmarks that are not shown are identical to DEDUP, i.e., the completion time and energy stay constant as k varies. The experiments are run with the best R T value of 3 obtained in Section 6.6.1. We observe that the completion time and energy of the Limited 3 classifier never exceeds by more than 2% the completion time and energy consumption of the Complete classifier except for STREAMCLUSTER. With STREAMCLUSTER, the Limited 3 classifier starts off new sharers incorrectly in non-replica mode because of the limited number of cores available for taking the 145 majority vote. This results in increased communication between the LI cache and LLC home location, leading to higher completion time and network energy. The Limited5 classifier, however, performs as well as the complete classifier, but incurs an additional 9KB storage overhead per core when compared to the Limited 3 classifier. From the previous section, we observe that the Limited 3 classifier performs better than all the other baselines for Hence, to trade-off the storage STREAMCLUSTER. overhead of our classifier with the energy and completion time improvements, we chose k = 3 as the default for the limited classifier. The Limited1 classifier is more unstable than the other classifiers. While it performs better than the Complete classifier for and STREAMCLUSTER LU-NC, it performs worse for the benchmarks. The better energy consumption in BARNES LU-NC is due to the fact that the Limited1 classifier starts off new sharers in replica mode as soon as the first sharer acquires replica status. On the other hand, the Complete classifier has to learn the mode independently for each sharer leading to a longer training period. 6.6.4 Cluster Size Sensitivity Analysis Figure 6-12 plots the energy and completion time for the locality-aware protocol when run using different cluster sizes. The experiment is run with the optimal RT of 3. Using a cluster size of 1 proved to be optimal. This is due to several reasons. In benchmarks such as BARNES, STREAMCLUSTER and BODYTRACK, where the working set fits within the LLC even with replication, moving from a cluster size of 1 to 64 reduced data locality without improving the LLC miss rate, thereby hurting energy and performance. Benchmarks like RAYTRACE that contain a significant amount of read-only data with low degrees of sharing also do not benefit since employing a cluster-based approach reduces data locality without improving LLC miss rate. Employing a clustered replication policy leads to an equal probability of placing that cache line in the LLC slice of any core within the cluster. A cluster-based approach can be useful to explore the trade-off between LLC data locality and miss rate only if the data is shared by mostly all cores within a cluster. 146 M C-1 * C-4 W WN N C-16 _' C-64 1.2 :N 0.8 0. 6 1. 0.4 0.2 E..... 01 W & ! 0.8 E o" Q M dP & Pe.C UC, 0 Figure 6-12: Energy and Completion Time at cluster sizes of 1, 4, 16 and 64 with the locality-aware data replication protocol. A cluster size of 64 is the same as R-NUCA except that it does not even replicate instructions. In benchmarks such as RADIX and FLUIDANIMATE that show no usefulness for replication, applying the locality-aware protocol bypasses all replication mechanisms and hence, employing higher cluster sizes would not be any more useful than employing a lower cluster size. Intelligently deciding which cache lines to replicate using an RT of 3 was enough to prevent any overheads of replication. The above reasons along with the added coherence complexity of clustering (as discussed in Section 6.4) motivate using a cluster size of 1, at least in the 64-core multicore target that we evaluate. 6.7 Summary This chapter proposed an intelligent locality-aware data replication scheme for the last-level cache. The locality is profiled at runtime using a low-overhead yet highly accurate in-hardware cache-line-level classifier. On a set of parallel benchmarks, the locality-aware protocol reduces the overall energy by 16%, 14%, 13% and 21% and 147 the completion time by 4%, 9%, 6% and 13% when compared to the previously proposed Victim Replication, Adaptive Selective Replication, Reactive-NUCA and Static-NUCA LLC management schemes. The coherence complexity of the described protocol is almost identical to that of a traditional non-hierarchical (flat) coherence protocol since replicas are only allowed to be created at the LLC slice of the requesting core. The classifier is implemented with 14.5KB storage overhead per 256KB LLC slice. 148 Chapter 7 Locality-Aware Cache Hierarchy Replication This chapter combines the private cache (i.e., LI) and LLC (i.e., L2) replication protocols discussed in Chapters 4 & 6 into a combined cache hierarchy replication protocol. The combined protocol exploits the advantages of both the private cache & LLC replication protocols synergistically. 7.1 Motivation The design of the protocol is motivated by the experimental observation that both LI & L2 cache replication enable variable performance improvement for benchmarks. Certain benchmarks like TSP & DFS exhibit improvement only with locality-aware Li replication, certain others like BARNES & RAYTRACE exhibit improvement only & with locality-aware L2 replication, while certain benchmarks like BLACKSCHOLES FACESIM exhibit improvement with both locality-aware LI & L2 replication. This necessitates the design of a combined LI & L2 replication protocol that obtains the benefits of both Li & L2 replication. The combined protocol should be able to provide at least the following 3 data access modes: 1. Replicate line in LI cache and access at L2 home location (this is the default in both locality-aware Li & L2 replication). 149 2. Replicate line in both Li & L2 caches, to leverage the benefits of locality-aware L2 replication. 3. Do not replicate line in Li cache and remote access at the L2 home location (using word-level operations), to leverage the benefits of locality-aware Li replication. In addition to the above 3 modes, a 4 th mode can be supported for increased efficiency. 4. Do not replicate line in Li cache, replicate line in L2 cache and remote access at the L2 replica location (using word-level operations). The 4 th mode is useful for applications where cache lines do not exhibit much reuse at the Li cache (since the per-thread working set does not fit within the LI) but exhibit significant reuse at the L2 cache. Compute Pipeline Core L1-1/ pute .1/ - Compu L2 Cache (LLC Slice) L1-I/ U-D ComI put L1-I/ Corm L.1-D pu L1-I/ L1-D 12 -/ pute Com- -l/ Com- Li I/ C - Li-I Com- Li-I/ pute -D pute p pute L D 212 L1- Li-D L2 12 Li Compute Li D L2 1 Com- Network Router C pute 12 C Private Li Caches L1-l/ Co / Co put pute Li I/ D Compute Li-/ Li-D L2l Li Replica Comute EZZ2 I/ D Compute Li-I L2 Replica 1- Cjm- L1 L2J Co LiLiCpute]L1pu L2 Home Block Figure 7-1: The red block prefers to be in the 1st mode, being replicated only at the Li-I/Li-D cache. Cores N and g access data directly at the Li cache. The blue block prefers to be in the 2 nd mode, being replicated at both the Li & L2 caches. Cores and P access data directly at the Li cache. The violet block prefers to be in the 3 rd mode, and is accessed remotely at the L2 home location without being replicated in either the Li or the L2 cache. Cores P and ( access data using remote-word requests at the L2 home location. And finally, the green block prefers the 4 th mode, being replicated at the L2 cache and accessed using word accesses at the L2 replica location. Cores g and T access data using remote-word requests. 150 These 4 modes of data access are depicted in Figure 7-1. The red block prefers to be in the to be in the 1 " mode, being replicated only at the Li cache. The blue block prefers 2 nd mode, being replicated at both the Li & L2 caches. block prefers to be in the 3 rd mode, and is accessed remotely at the L2 home location without being replicated in either the Li or the L2 cache. block prefers the 4 t" The violet And finally, the green mode, being replicated at the L2 cache and accessed using word accesses at the L2 replica location. 7.2 Implementation The combined replication protocol starts out as a private-Li shared-L2 cache hierarchy. If a cache line shows less or more reuse at the Li or L2 cache, its classification is adapted accordingly. We first describe the hardware modifications that need to be implemented for the functionality of the protocol and later walk through how memory requests are handled. This includes details about how the protocol transitions between the multiple data access modes shown in Figure 7-1. 7.2.1 Microarchitecture Modifications Since both locality-aware Li & L2 replication need to be implemented, each cache line tag at the L2 home location should contain both the Li-Classifier and the L2Classifier as shown in Figure 7-2. The Li-Classifier manages replication in the Li cache while the L2-Classifier manages replication in the L2 cache. Both the Li- Classifier and L2-Classifier are looked up on access to the L2 home and together decide whether the cache line should be replicated at the Li or L2 or not replicated at all. The Li & L2 classifiers each contain Mode, Home Reuse, and RAT-Level fields that serve to remember the past locality information for each cache line. (Note that the RAT-Level field was only used for private cache (/Li) replication as introduced in Chapter 4 and was not employed by the LLC (/L2) replication mechanism. The use of the RAT-Level in the combined mechanism will be explained later). On the other hand, a classifier at the L2 replica location should contain just 151 Tag LRU Tag RU State Sate ACK Wise Pointers (1 ... p) Limited Locality List (1 .. k) Figure 7-2: Modifications to the L2 cache line tag. Each cache line is augmented with a Private Reuse counter that tracks the number of times a cache line has been accessed at the L2 replica location. In addition, each cache line tag has classifiers for deciding whether or not to replicate lines in the Li cache & L2 cache. Both the Li & L2 classifiers contain the Mode, Home Reuse, and RAT-Level fields that serve to remember the past locality information for each cache line. The above information is only maintained for a limited number of cores, k, and the mode of untracked cores is obtained by a majority vote. the Li-Classifier to determine whether the cache line should be replicated at the Li or not. In addition, the cache tag at the L2 replica location should also contain a Private Reuse counter that tracks the number of times the cache line at the L2 replica location has been reused. This counter is communicated back to the L2 home on an invalidation/eviction to determine future replication decisions. To maintain hardware simplicity, each L2 cache tag holds all the above described fields (i.e., the Li-Classifier, L2-Classifier and the Private Reuse counter). The L2Classifier is not used at the L2 replica location and the Private Reuse counter is not used at the L2 home location. Together, these hardware structures implement the 4 modes of data access as explained in Section 7.1. 7.2.2 Protocol Operation We consider read requests first, write requests next and finally, evictions & invalidations. 152 Read Requests: At Li Cache: On a read request (includes an instruction cache access), first the Li cache is looked up. If the data is present, it is returned to the compute pipeline and the private reuse counter at the Li cache is incremented. The private reuse counter tracks the number of times a cache line has been reused at the Li cache and serves to make future decisions about whether a cache line must be replicated at the Li cache. It is communicated back on an invalidation or eviction. At L2 Replica: If the data is not present at the Li cache, the request is sent to the L2 replica location. If the data is present at the L2 replica location, then the Li-classifier is looked up to get the mode of the cache line. If the mode is private, then a read-only copy of the cache line is transferred back to the requesting core. The line is inserted into the Li cache with the private reuse counter initialized to '1'. If the mode is remote, then the remote reuse counter (of the Li-classifier) at the L2 replica location is incremented to indicate that the cache line was reused at its remote location. If the remote reuse is > PCT, the sharer is "promoted" to private mode and a cache line copy is handed to it. Else, the requested word is returned to the core. Finally, the private reuse counter at the L2 replica is also incremented to indicate that the cache line has been reused at the L2 replica location. At L2 Home: If the data is not present at the L2 replica location, the request is forwarded to the L2 home location. If the data is present at the L2 home location, then the directory (co-located with the L2 home) obtains the most recent copy of the cache line from the sharers. Then, both the Li-classifier and the L2-classifier are looked up to get the mode of the cache line. If the L2-classifier indicates that the mode is Private, the line is sent to the L2 replica location and the mode provided by the Li-classifier at the L2 home is used to initialize the mode in the Li-classifier at the L2 replica location. On the other hand, if the L2-classifier returns a Remote mode, then the remote reuse counter in the L2-classifier is incremented. The messages sent out depend on the mode returned by the Li-classifier. 153 If the Li-classifier indicates that the mode is Private, then the cache line is sent to the Li cache. But if the Li-classifier also returns a Remote mode, then the requested word is directly sent to the core. The remote reuse counter of the Li-classifier at the L2 home location is also incremented. At DRAM Controller: If the data is not present at the L2 home location, the request is forwarded to the DRAM controller. Once the cache line is returned from DRAM, the Li-classifier and L2-classifier are initialized such that the cache line tracks the default private-L1 shared-L2 cache hierarchy. Write Requests: At Li Cache: On a write request, first the Li cache is looked up. If the data is present, then the word (i.e., write data) is directly written to the Li cache. The private reuse counter at the Li cache is incremented to indicate that that the cache line has been reused. At L2 Replica: If the data is not present at the Li cache, the request is sent to the L2 replica location. If the data is present at the L2 replica location, then the Liclassifier is looked up to get the mode of the cache line. If the mode is Private, then the line is transferred back to the requesting core. The line is inserted into the Li cache with the private reuse counter initialized to 1. If the mode is remote, then the remote reuse counter (of the Li-classifier) at the L2 replica location is incremented to indicate that the cache line was reused at its remote location. If the remote reuse is > PCT, the sharer is "promoted" to private mode and a shared read-write copy of the cache line copy is handed to it. Else, the word is directly written to the L2 cache. The private reuse counter at the L2 replica is also incremented to indicate that the cache line has been reused at the L2 replica location. At L2 Home: If the data is not present at the L2 replica location, the request is forwarded to the L2 home location. If the data is present at the L2 home location, then the directory (co-located with the L2 home) performs the following actions: (1) it invalidates all the private sharers of the cache line, and (2) it sets the remote reuse counters of all its remote sharers to '0'. The Li-classifier and the L2-classifier are 154 then looked up to get the mode of the cache line. If the L2-classifier indicates that the mode is Private, a private read-write copy of the line is sent to the L2 replica location. The mode provided by the Li-classifier at the L2 home is used to initialize the mode in the Li-classifier at the L2 replica location. On the other hand, if the L2-classifier returns a Remote mode, then the remote reuse counter in the L2-classifier is incremented. The response sent out now depends on the mode returned by the Li-classifier. If the Li-classifier indicates that the mode is Private, then a private read-write copy of the cache line is sent to the Li cache. But if the Li-classifier also returns a Remote mode, then the word is directly written to the L2 cache. The remote reuse counter of the Li-classifier at the L2 home location is also incremented. At DRAM Controller: If the data is not present at the L2 home location, the request is forwarded to the DRAM controller. Once the cache line is returned from DRAM, the Li-classifier is initialized to Private mode while the L2-classifier is initialized to Remote mode. This enables the cache line to be replicated in the Li cache but not in L2 cache by default. Evictions and Invalidations: When the cache line is removed from the private Li cache/L2 replica location due to eviction (conflict or capacity miss) or invalidation (exclusive request by another core), the private reuse counter is communicated to its backing location. From Li Cache: On an invalidation response from the Li cache to the L2 replica L2 home location, the Li-classifier is looked up to get the remote reuse corresponding to the sharer. If the (private + remote) reuse is > PCT, the line is still allowed to be replicated in the Li cache. Else, the line is not allowed to be replicated in the Li cache and "demoted" to a remote sharer. On an eviction response from the Li cache to the L2 replica / L2 home location, the private reuse is compared to PCT If private reuse > PCT, the line is continued to be replicated in the Li cache. Else, the line is "demoted" to the status of a remote sharer. The remote access threshold (RAT) is also increased to the next level. 155 From L2 Replica: On an invalidation response from the L2 replica to the L2 home, the L2-classifier is looked up to get the remote reuse. If the (private + remote) reuse is > RT, the line is continued to be replicated in the L2 cache. Else, the line is not allowed to be replicated and the core is "demoted" to a remote sharer. On an eviction response from the L2 replica to the L2 home location, the private reuse is compared to PCT. If private reuse > RT, the line is continued to be replicated in the L2 cache. Else, the line is "demoted" to the status of a remote sharer. The remote access threshold (RAT) in the L2-classifier is also increased to the next level. 7.2.3 Optimizations Remote Access Threshold (RAT): To improve the performance of the Li-classifier and L2-classifier working together, the Remote Access Threshold (RAT) scheme is used (c.f. Section 4.3). On an eviction where the next mode is remote, the classifier incrementally raises the threshold for transitions from remote to private mode. This is done so as to make it harder for the core to be promoted to private mode in case it wants to stay in remote mode. This prevents the core from ping-pong'ing between private & remote modes. With the combined classifier, this also presents either the Li-classifier or the L2-classifier with a fair chance of obtaining the cache line in private mode in case the other one continually classifies the line into remote mode. Limited Locality Classifier: The Limited 3 classifier is used to reduce the overhead needed to track the mode, remote reuse, and RAT-level counters. These counters are only tracked for 3 cores and the modes of other cores are obtained using a majority vote. The management of this limited classifier is done in the same way as described previously in Chapters 4 & 6. 7.2.4 Overheads Each L2 cache tag contains a private reuse counter as well as the Li and L2 classifiers. Assuming a PCT of 4, an R T of 3, the Limited 3 classifier, 2 RATlevels and an RATa, of 16, the storage overhead for each cache tag = 2+3 x (6+ 2 x (I+ 4 +1)) 156 = 56 bits. The storage overhead per-core for L2 cache tags = 56 x 212 bits = 28 KB. Each LI cache tag contains a private reuse counter as well. The storage overhead per-core for LI cache tags = 2 x 3 x 28 bits = 0.19 KB. The timestamp scheme, that serves to implement load speculation while conforming to a particular memory consistency model, incurs 2.5 KB overhead. Overall, the storage overhead per-core is 30.7 KB. 7.3 Evaluation Methodology We evaluate a 64-core shared memory multicore using out-of-order cores. The default architectural parameters used for evaluation are shown in Table 2.1. The parameters specific to the timestamp-based speculation violation detection and the locality-aware protocols are shown in Table 5.1. Architectural Parameter Value Locality-Aware Private Cache Replication Private Caching Threshold PCT= 4 Max Remote Access Threshold RATmaX = 16 Number of RAT Levels nRA Tevels = 2 Classifier Limited 3 Locality-Aware LLC Replication Replication Threshold RT= 3 Max Remote Access Threshold RA Tmax = 16 Number of RAT Levels nRATievels = 2 Classifier Limited 3 Out-of-Order Core Speculation Violation Detection Timestamp-based Timestamp Width (TW) 16 bits History Retention Period (HRP) 512 ns LI Load/Store History Queue (L1LHQ/L1SHQ) Size 0.8 KB, 0.4 KB L2 Load/Store History Queue (L2LHQ/L2SHQ) Size 0.4 KB, 0.2 KB 0.2 KB Pending Store History Queue (PSHQ) Size Table 7.1: Locality-aware protocol & timestamp-based speculation violation detection parameters. 157 7.4 Results In this section, the following four schemes are compared. 1. RNUCA: Reactive-NUCA is the baseline scheme that implements the data placement and migration techniques of R-NUCA (basically, the locality-aware protocol with a PCT of 1). 2. Li: Locality-aware Li (/private cache) replication with a PCT of 4. 3. L2: Locality-aware L2 (/LLC) replication with a RT of 3. 4. L1+L2: Locality-aware L1+L2 (cache hierarchy) replication with a PCT of 4 and RT of 3. Figures 7-3 and 7-4 plot the completion time and energy obtained when using the above 4 design alternatives. Completion Time: We observe that the Li + L2 scheme in general, tracks the best of locality-aware Li or L2 replication. For example, Li + L2 tracks the performance of Li in bench- marks such as CONCOMP and TSP and the performance of L2 in benchmarks such as PATRICIA and LU-NC. In the FACESIM benchmark where both Li & L2 provide benefits, the Li + L2 scheme improves upon the benefits provided by the two protocols. This is because Li + L2 possesses the functionality of both the Li and L2 schemes and can adaptively decide the best mode for a line. Only in two benchmarks OCEAN-NC & STREAMCLUSTER does Li + L2 perform worse than the best of Li and L2. This is due to load speculation violations created by the timestamp scheme. These load speculation violations arise in the critical section of the application, and hence, they increase the synchronization time as well. Note, that in RNUCA and the L2 scheme, cache lines are always replicated in the Li cache, hence invalidation/ update requests can be relied on to detect load speculation violations. We don't model violations due to invalidation requests in our simulator, so the performance provided by RNUCA and the L2 is an upper bound. 158 E Instructions A 1-- Fetch Stalls 4 Branch Speculation E Load Speculation 0 Compute Stalls 0 Memory Stalls N Synchronization N Idle -1.2 Ili E0.8 EEl iiI 11 0.6 0.4 0 iI 0.2 0 Ia0.611~ IIi1 11VI 111li DI VO REND BARNES CHOLESKY RAYTRACE OCEANNC Z Z RADIX FT LULC LUG 1.2 E.1 C .8 '00.4 0 Z z Z WTRNQ CANNEAL FACESIM SWAPTIONS STREAMCLUS. Z Z Z FLUIDANIM. BLACKSCH. BODYTRACK DEDUP 1.2 1 0 =~ wE hEre. nrazetotat and Figuees7-3 R-UA Af a5esetv etAt Wpae A ccthe RNUC ead o emercMa Ae As p Ate baseccne CmpetibonTmereadoo the srcheest evauaegy Rensumtiar aOverltheLi, L2sandes Lid n shmesbnhaks tee improves ieb 13%, 10% best. The Li + L2 scheme tracks the energy of the Li scheme in benchmarks such as RADIX, DIJKSTRA-AP & DFs and the energy of the L2 scheme in benchmarks such 159 L1-1 Cache U 8 L1-D Cache E Network Router 0 Network Link N L2 Cache " Directory * DRAM " History Queues 1.2 m I I I I I . 1i I 0.8 0.6 PO.4 0 1111 .... ---- + DI -j Z BARNES VOLREND CHOLESKY FFT OCEAN-NC RAYTRACE 1.2 I LU-C RADIX I I U + 0.2 U LU-NC a m0.8 0.6 0.4 CU 0.2 0 D -4 IC4 Z WATER-NSQ 1.2 -J: CANNEAL FACESIM SWAPTIONS STREAMCLUS. I z W + < - LU FLUIDANIM. BLACKSCH. BODYTRACK DEDUP a N 111 m0.8 LU 0.2 0 III iLiLi + 5i D FERRET III'U CCi~ PATRICIA Di z'C CONCOMP Li -I: _J: Z COMMUNITY Z W C-4 -j + rl -j DIJKSTRA-AP C*4 -j -I -j + r-4 D Z ot 11 + 0.4 + 00.6 -j q: cc TSP DFS MATMUL AVERAGE to Figure 7-4: Energy breakdown for the schemes evaluated. Results are normalized here. plotted is that of R-NUCA. Note that Average and not Geometric-Mean as BARNES, BLACKSCHOLES and SWAPTIONS. In benchmarks such as RAYTRACE, L2 scheme has a FACESIM, STREAMCLUSTER, BODYTRACK, and PATRICIA, the Li + to lower energy consumption than both the Li and L2 schemes due to its capability the line. In select the best of the two at cache line granularity based on the reuse of 160 addition, the Li + L2 scheme also introduces a 4 "h mode which allows cache lines to be replicated in the L2 but not in the L1. This allows more efficient access to cache lines that have a higher reuse distance. In benchmarks such as VOLREND, and CHOLESKY whose energy is dominated by the Ll-D cache (since they are only a few Li cache misses), both the Li and the L I + L2 scheme have to incur the energy overhead of accessing the LI and L2 history queues on every cache access, and this increases their overall energy consumption. In such benchmarks, the Li + L2 scheme is only able to perform as well as the Li scheme. In other benchmarks such as WATER-NSQ and LU-NC, the Li + L2 scheme gets all the network energy benefits of the L2 scheme, but incurs the overhead for access to the history queues. In the CONCOMP benchmark, most cache lines prefer not to be replicated in the Li cache and accessed remotely at the L2 home location due to almost no reuse at the Li cache (an Li cache miss rate of 42%). The L1+L2 scheme pays an occasional overhead of placing these cache lines in the L2 replica location in order to learn whether they are reused at the L2. This overhead increases the network & DRAM energy due to additional evictions from the L2 cache by a small amount. In the FLUIDANIMATE benchmark, the modes of cache lines change extremely frequently over time, thereby causing the locality-aware coherence protocols (that are based on the immediate past history) to incur false classifications. This increases the network and L2 cache energy consumption over the RNUCA baseline. Overall, the L1, L2 and Li + L2 schemes improve energy by 15%, 15% and 22% respectively compared to the RNUCA baseline. 7.4.1 PCT and RT Threshold Sweep In this section, the PCT and RT parameters of the L1 + L2 scheme are varied from 1 to 8 and the resulting completion and energy are plotted in Figures 7-5 and 7-6. We observe that the completion time & energy are high at low values (i.e., 1,2) of PCT & RT. This is due to the network traffic overheads, low cache utilization, and the resultant processor stalls incurred due to replicating low reuse cache lines in the 161 I 0.92 0.91 0.910.9 . 0.88, 0.88 -40,9 0.89 0 8 6 2 2 0 0 Replication Threshold (RT) Private Caching Threshold (PCT) Figure 7-5: Variation of Completion Time as a function of PCT& R T. The GeometricMean of the completion time obtained from all benchmarks is plotted. 0.9 0.95 6 8 6 4 4 08 0.9 0.85 0.8 8 Replication Threshold (RT) 0.75 2 2 0 0 Private Caching Threshold (PCT) Figure 7-6: Variation of Energy as a function of PCT & RT. The Geometric-Mean of the energy obtained from all benchmarks is plotted. 162 Li & L2 cache. As PCT & RT increase to mid-range values (i.e., to 3,4,5), both the completion time & energy consumption reduce drastically. After that, completion time & energy increase gradually. A PCT of 4 and an RT of 3 is selected because they provide the best Energy x Delay product among the possible <PCT,RT> combinations. If the best <PCT,RT> combination is selected for each benchmark separately, the completion time and energy are only improved by 2% compared to using a PCT of 4 and an RT of 3 for all benchmarks. This justifies a static selection for PCT & RT. 7.5 Summary This chapter combines the private cache (i.e., Li) and LLC (i.e., L2) replication schemes discussed in Chapters 4 & 6 into a combined cache hierarchy replication & scheme. The combined scheme exploits the advantages of both the private cache LLC replication protocols synergistically. Overall, evaluations on a 64-core multicore processor show that locality-aware cache hierarchy replication improves completion time by 15% and energy by 22% compared to the Reactive-NUCA baseline and can be implemented with a 30.7 KB storage overhead per core. 163 164 Chapter 8 Related Work Previous research on cache hierarchy organizations & implementation of memory consistency models in multicore processors can be discussed based on the following eight criteria. 1. Data replication 2. Coherence directory organization 3. Selective caching / dead block eviction 4. Remote Access 5. Data placement and migration 6. Cache replacement policy 7. Cache partitioning / cooperative cache management 8. Memory Consistency Models 8.1 Data Replication Previous research on data replication in multicore processors mainly focused on the last level cache. All other cache levels have traditionally been organized as private to 165 a core and hence data can be replicated in them based on demand without any additional control strategy. Last level caches (LLCs) have been organized as private [26], shared [2] or a combination of both [97, 24, 10, 39]. The benefits of having a private or shared LLC organization depend on the degree of sharing in an application as well as data access patterns. While private LLC organizations have low hit latencies, their off-chip miss rates are high in applications that exhibit high degrees of sharing due to cache line replication. Shared LLC organizations, on the other hand, have high hit latencies since each request has to complete a round-trip over the interconnection network. This hit latency increases as more cores are added since the diameter of practically feasible on-chip networks increases with the number of cores. However, their off-chip miss rates are low since cache lines are not replicated. Both private and shared LLC organizations incur significant protocol latencies when a writer of a cache block invalidates multiple readers; the impact being directly proportional to the degree of sharing of the cache block. (Note that processors with shared LLC organizations typically have private lower-level caches). Four recently proposed hybrid LLC organizations that combine the good characteristics of private and shared LLC organizations are CMP-NuRAPID [241, Victim Replication [97], Adaptive Selective Replication [10], and Reactive-NUCA [39]. CMP-NuRAPID (Non-Uniform access with Replacement And Placement usIng Distance associativity) [24] uses Controlled Replication to place data so as to optimize the distance to the cache bank holding the data. The idea is to decouple the tag and data arrays and maintain private per-core tag arrays and a shared data array. The shared data array is divided into multiple banks based on distance from each core. A cache line is replicated in the cache bank closest to the requesting core on its second access, the second access being detected using the entry in the tag array. This scheme does not scale with the number of cores since each private per-core tag array potentially has to store pointers to the entire data array. Results indicate that the private tag array size used in CMP-NuRAPID should only be twice the size of the per-cache bank tag array but this is because only a 4-core CMP is evaluated. 166 In addition, CMP-NuRAPID requires snooping coherence to invalidate replicas as well as additional cache controller transient states for ensuring the correct ordering of invalidations and read accesses. Victim Replication (VR) starts out with a private Li shared L2 organization and uses the local L2 slice as a victim cache for data that is evicted from the Li cache. The eviction victims are placed in the L2 slice only if a line is found that is either invalid, a replica itself or has no sharers in the Li cache. By only replicating the Li capacity victims, this scheme attempts to combine the low hit latency of private LLCs with the low off-chip miss rates of shared LLCs. However, this strategy blindly replicates all Li capacity victims without paying attention to cache pressure. Adaptive Selective Replication (ASR) operates similar to Victim Replication by replicating cache lines in the local L2 slice on an Li eviction. However, it pays attention to cache pressure by basing its replication decision on a probability. The probability value is picked from discrete replication levels on a per-cache basis. A higher replication level indicates that Li eviction victims are replicated with a higher probability. The replication levels are decided dynamically based on the cost and benefit of replication. When operating at a particular replication level, ASR estimates the cost and benefit of increasing or decreasing the level using 4 hardware monitoring circuits. Both VR and ASR have the following three drawbacks. 1. L2 replicas are allocated without paying attention to whether they will be referenced in the near future. Applications with a huge working set or those with a high proportion of compulsory misses (streaming workloads) are adversely affected by this strategy. Applications with cache lines that are immediately invalidated without additional references from the Li do not benefit as well. 2. If the data is not found in the local L2 slice, the read request has to be sent to the home L2 cache. The capacity of the L2 cache is not shared amongst neighboring cores. Applications that widely share instructions and data (through read accesses) are not well served by pinning the replication location to the 167 local L2 slice. Sharing a replication location amongst a cluster of cores would have been the optimal strategy. 3. The L2 cache slices have to always be searched and invalidated along with the Li cache. Although the performance overhead is small, it increases the complexity of the protocol. Reactive-NUCA replicates instructions in one LLC slice per cluster of 4 cores using rotationalinterleaving. Data is never replicated and is always placed at a single LLC slice. The one size fits all approach to handling instructions does not work for applications with heterogeneous instructions nor does it work for applications where the optimal cluster size for replication is not 4. In addition, the experiments in Chapter 6 show significant opportunities for improvement through replication of shared read-only and shared read-write data that is frequently read and sparsely written. The locality-aware data replication scheme discussed in this thesis does not suffer from the limitations of the above mentioned schemes. It only replicates cache lines that show reuse at the LLC, bypasses replication mechanisms for cache lines that do not exhibit reuse and adapts the cluster size according to application needs to optimize performance and energy consumption. The cache lines that are replicated are purely those that exhibit reuse and include instructions, shared read-only and shared read-write data. No coarse-grain classification decisions guide the replication process. In addition to the above mentioned drawbacks, all the schemes discussed leave the private caches unmanaged. A request for data allocates a cache line in the private cache hierarchy even if the data has no spatial or temporal locality. This leads to cache pollution since such low locality cache lines can displace more frequently used data. The locality-aware coherence protocol discussed in this thesis focuses on intelligent management of private caches. Managing private caches is important because (i) they are generally capacity-stressed due to strict size and latency limitations, and (2) they replicate shared data without paying any attention to its locality. 168 8.2 Coherence Directory Organization Several proposals have been made for scalable directory organizations. Techniques include reducing the size of each directory entry as well as increasing the scalability of the structure that stores directory entries. Hierarchical directory organizations [64] enable area-efficient uncompressed vector storage through multiple serialized lookups. However, hierarchical organizations impose additional lookups on the critical path, hurting latency and increasing complexity. Limited directories [6] have been proposed that invalidate cache lines so as to maintain a constant number of sharers. Such schemes hurt cache lines that are widely read-shared. Limited directories with software support [21] remove this restriction but require OS-support. Chained directories [20] maintain a linked list of sharers (one LI cache tag pointing to another) but are complex to implement and verify due to distributed linked list operations. Coarse vectors [37] maintain sharing information per cluster of cores and hence, cause higher network traffic and complexity. Duplicate-Tag directory [9, 88] reduces storage space but requires an energyintensive associative search to retrieve sharing information. creases as more cores are added. Tagless directory This associativity in- [96] removes the energy-inefficient associative lookup of the Duplicate-Tag organization using Bloom filters to represent the sharers. However, it adds a lot of extra complexity since false positives (shar- ers marked present at the directory but not actually present) need to be handled correctly. Moreover, Tagless requires extra lookup and computation circuitry during eviction and invalidation to reset the relevant bloom filter bits. Sparse directory schemes [37, 70] organize the directory like a set-associative cache with low associativity and are more power-efficient than Duplicate-Tag directories. But they incur directory-induced back-invalidations when some sets are more heavily accessed than other. Hence, considerable area cost is expended in over-provisioning the directory capacity to avoid set conflicts. Cuckoo directory [32] avoids these set conflicts using an N-ary Cuckoo Hash Table with different hash functions for each way. Unlike a regular set-associative organization that always picks a replacement 169 victim from a small set of conflicting entry, the Cuckoo directory displaces victims to alternate non-conflicting ways, resorting to eviction only in exceptional circumstances. Scalable Coherence Directory (SCD) [80] introduces variable sharer set representation to store the sharers of a cache line. While a cache line with a few sharers uses a single directory tag, widely shared cache lines uses multiple tags. SCD operates like a limited directory protocol when the sharers can be tracked with the single directory tag. As the number of sharers grow, it switches to hierarchical sharer tracking with the root tag tracking sharing at a cluster-level and the leaf tags tracking the sharing within each cluster. SCD uses Cuckoo directories/ ZCache [78] for high associativity. In-cache directory [18, 13] avoids the overhead of adding a separate directory structure as in Sparse directories but is area-inefficient because the lower level caches (LI) are much smaller than the higher level ones (L2). However, in-cache directories do not suffer from back-invalidations. For a CMP, in-cache directories are only practical with at least one shared cache in its cache hierarchy. SPACE [99] (Sharing pattern-based directory coherence) exploits sharing pattern commonality to reduce directory storage. If multiple cache lines are shared by the same set of cores, SPACE allocates just one sharing pattern for them. However, these sharing patters must be replicated if the directory is distributed among multiple cores (which needs to be done in modern CMPs to avoid excessive contention). SPATL [100] (Sharing-PAttern based TagLess directory) decouples the sharing patterns from the bloom filters of Tagless and eliminates the redundant copies of sharing patterns. Although this results in improved directory storage over Tagless and SPACE, the additional complexity of Tagless remains. In-Network Cache Coherence [29] embeds directory information in network routers so as to remove directory indirection delays and fetch cache lines from the closest core. The directory information within the network routers is organized as a tree and hence requires a lot of extra complexity and latency during writes to cleanup the tree. Additionally, each router access becomes more expensive. The above described schemes are either inefficient or increase the directory organization complexity significantly using a compressed representation of the sharers or 170 require complex on-chip network capabilities. The ACKwise [56, 57] directory coherence protocol on the other hand uses a simple sharer representation with a limited directory and relies on simple changes to an electrical mesh network for broadcasts. Since the number of sharers is always tracked, acknowledgements are efficient. In addition, the locality-aware cache coherence protocol reduces the number of invalidations to cache lines with low-spatio temporal locality, making ACKwise more efficient. The locality-aware data replication scheme tries to incorporate the benefits of hierarchical directories (i.e., hierarchical sharer tracking and data locality) by replicating those cache lines with high utility at the L2 cache. In addition, it avoids the pitfalls of hierarchical directories by not performing serial lookups or hierarchical invalidations for cache lines that do not benefit from replication. 8.3 Selective Caching / Dead-Block Eviction Several proposals have been made for selective caching in the context of uniprocessors. Selective caching, a.k.a. cache bypassing, avoids placing the fetched cache line in the cache and either discards it or places it in a smaller temporary buffer so as to improve cache utilization. Most previous works have explored selective caching in the context of prefetching. Another related body of work is dead-block eviction. Dead-blocks are cache lines that will not be reused in the future and hence may be removed so as to improve prefetching or reduce cache leakage energy. McFarling [67 proposed dynamic exclusion to reduce conflict misses in a direct- mapped instruction cache. Stream Buffers [48] place the prefetched data into a setassociative buffer to avoid cache pollution. Abraham et al [5] show that fewer than 10 instructions account for half the cache misses for six out of nine SPEC89 benchmarks. Tyson et al [89] use this information to avoid caching the data accessed by such instructions so as to improve cache utilization. Gonzalez et al [35] propose a dual data cache with independent parts for managing data with spatial and temporal locality. They also implement a lazy caching policy which tries not to cache anything, until a benefit (in terms of spatial or temporal locality) can be predicted. The locality of a 171 data reference is predicted using a program counter (PC) based prediction table. Lai et al [58] propose dead-block predictors (DBPs) that use a trace of memory references to predict when a block in a data cache becomes evictable. They also propose dead-block correlating prefetchers (DBCPs) that uses address correlation in conjunction with dead-block traces to predict a subsequent address to prefetch. Lui et al [61] increase dead block prediction accuracy by predicting dead blocks based on bursts of accesses to a cache block. These cache bursts are more predictable since they hide the irregularity of individual references. The best performance optimizing strategy is to then replace these dead blocks with prefetched blocks. While the locality-aware coherence protocol is orthogonal to the dead-block eviction techniques, it differs from the above proposed selective caching techniques in the following ways: 1. In prior selective caching schemes, the referenced data is always brought into the core but is then placed in a set-associative buffer or discarded. On the contrary, the locality-aware protocol selectively decides to move a cache line from the shared LLC to the private cache or simply accesses the requested word at the shared LLC, thereby reducing the energy consumption of moving unnecessary data through the on-chip network. 2. All prior selective caching proposals have focused on uniprocessors, while this thesis targets large-scale shared memory multicores running multithreaded applications with shared data. In addition to the traditional private data, the locality-aware protocol also tracks the locality of shared data and potentially converts expensive invalidations into much cheaper word accesses that not only improve memory latency but also reduces the energy consumption of the network and cache resources. 3. Prior schemes use a program counter (PC) based prediction strategy for selective caching. This may lead to inaccurate classifications and is insufficient for programs that want to cache a subset of its working set in the cache. On the 172 contrary, the locality-aware protocol works at the fine granularity of cache lines and thereby avoids the above shortcomings. 8.4 Remote Access Remote access has been used as the sole mechanism to support shared memory in multicores [92, 41, 31]. Remote access in coordination with intelligent cache placement [31] has been proposed as an alternative to cache coherence. Remote store programming (RSP) has been used to increase the performance of certain HPC applications [41] by placing data at the consumers. This is beneficial because processor performance is more sensitive to load latency than store latency (since stores can be hidden using the store queue). Software-controlled asynchronous remote stores [63] has been proposed to reduce the communication overhead of synchronization variables such as locks, barriers and presence flags. Recently, research proposals that utilize remote access as an auxiliary mechanism [71, :5], have demonstrated improvement in performance and energy consumption. In this thesis, we observe that complex cores that support popular memory models (e.g., x86 TSO, ARM, SC) need novel mechanisms to benefit from these adaptive protocols. 8.5 Data Placement and Migration Previous works have tackled data placement and migration either using compiler techniques or a combination of micro-architectural and operating system techniques. Yemliha et al [95, 98] discuss compiler-directed code and data placement algo- rithms that build a CDAG (Code Data Affinity Graph) offline and use the edge weights of the graph to perform placement. These techniques are limited to benchmarks whose access patterns can be determined statically and cannot accommodate dynamic behavior. Kim et al [;)0] first proposed non-uniform caches (NUCA) for uniprocessors. They 173 propose two variants: Static-NUCA (SNUCA) and Dynamic-NUCA (DNUCA). While SNUCA statically interleaves cache blocks across banks, DNUCA migrates high locality blocks to the bank closest to the cache controller. DNUCA uses parallel multicast or sequential search to access data starting from the closest bank to the farthest bank. Chishti et al [23] propose NuRAPID (Non-Uniform access with Replacement And Placement usIng Distance associativity) to reduce the energy consumption in NUCA. It decouples the tag array from the data array, using a central tag array to place the frequently accessed data in the fastest subarrays, with fewer swaps than DNUCA, while placing rarely accessed data in the farthest subarrays. It exploits the fact that tag and data access are performed sequentially in a large last-level cache. Beckmann et al [11] propose CMP-SNUCA and CMP-DNUCA that extend these cache architectures to CMPs. They observe that most of the L2 hits are made to shared blocks, and hence block migration in CMP-DNUCA does not prove to be very useful since it ends up placing the shared block at the center of gravity of the requesters (not close to any requester). In addition, block migration necessitates a 2-phase multicast search algorithm to locate the L2 cache bank that holds a particular cache block. This search algorithm degrades the performance of CMP-DNUCA even below that of CMP-SNUCA in many applications. The authors also evaluate the benefits of strided prefetching with CMP-SNUCA (static interleaving) and find that to be greater than that of block migration with CMP-DNUCA. CMP-NuRAPID [24] (discussed earlier) uses Controlled Replication to force just one cache line copy for read-write shared data and forces write-through to maintain coherence for such lines using a special Communication (C) coherence state. The drawback again is that the size of per-core private tag arrays do not scale with the number of cores. Cho et al [251 propose using page-table and TLB entries to perform data placement. However, they leave it to future work to perform intelligent page allocation policies. Reactive-NUCA [39] uses the above technique to place private data local to the requesting core and statically interleave shared data in different slices of the shared L2 cache. The only overhead is that when the classification of a page turns 174 from private to shared, the existing cache copies of the page at its first requester have to be purged. Dynamic directories [28] adapt the same placement and migration policies of Reactive-NUCA for directory entries with a few modifications. Directory entries are not placed for private data. For shared data, directory entries are statically interleaved using a hash function of the address. In this thesis, the placement and migration algorithms of Reactive-NUCA are reused. However, shared data and instructions may be replicated in an L2 cache slice close to the requester as described earlier. 8.6 Cache Replacement Policy Qureshi et al [75] divide the problem of cache replacement into two parts: victim selection policy and insertion policy. The victim selection policy decides which line gets evicted for storing an incoming line, whereas, the insertion policy decides where in the replacement list the incoming line is placed. Their proposed replacement policy is motivated by the observation that when the working set is greater than the cache size, the traditional LRU replacement policy causes all installed lines to have poor temporal locality. The optimal replacement policy should retain some fraction of the working set long enough so that at least that fraction of the set provides cache hits. They propose the Bimodal Insertion Policy which inserts the majority of the incoming cache lines in the LRU position and a few in the MRU position. This reduces cache thrashing in workloads where the working set is greater than the cache size. The incoming lines that are placed in the LRU position get promoted to the MRU position only if they get re-referenced. They also propose the Dynamic Insertion Policy which switches between the Bimodal Insertion Policy (for LRU-unfriendly workloads) and the traditional LRU policy (for LRU-friendly workloads) at an application-level based on dynamic behavior using Set Dueling Monitors (SDMs). The victim selection policy still remains the same as in the traditional LRU scheme. Jaleel et al [45] modify the above policy for multiprogrammed workloads and pro175 pose Thread Aware Dynamic Insertion Policy (TADIP) which is cognizant of multiple workloads running concurrently and accessing the same shared cache. They propose using the Bimodal Insertion Policy for streaming applications as well as cache thrashing applications while using the traditional LRU Insertion Policy for recency friendly applications. Jaleel et al [46] further propose cache replacement using re-reference interval pre- diction (RRIP) to make a cache scan-resistant in addition to preserving its thrashresistant property using TADIP. Scans are access patterns where a burst of references to non-temporal data discards the active working set from the cache. RRIP operates using an M-bit register per cache block to store its re-reference interval prediction value (RRPV) in order to adapt its replacement policy for data cached by scans. This thesis applies different techniques to remotely cache low-locality data that participates in thrashing and scans at the shared LLC location. Controlled replication is also used to achieve the optimal data locality and miss rate for the L2 cache. 8.7 Cache Partitioning / Cooperative Cache Management With the introduction of multicore processors, it is now possible to execute multiple applications concurrently on the same processor. This makes last-level cache (LLC) management complicated because these application may have varying cache demands. While shared caches enable effective capacity sharing, i.e., applications can use cache capacity based on demand, private caches constrain the capacity available to an application to the size of the cache on the core it is executing. This is bad for concurrently executing applications that have varying working set sizes. On the other hand, private caches inherently provide performance isolation so a badly behaving application cannot hurt the performance of other concurrently executing applications. Several recent proposals exist that combine the benefit of shared and private caches. While cache partitioning schemes 176, 79] divide up the shared cache among multiple ap- 176 plications so as to optimize throughput and fairness, co-operative cache management schemes [22, 74, 40, 77] start with a private cache organization and transfers evicted lines to other private caches (spilling) to exploit cache capacity. Utility-based cache partitioning [76] uses way-partitioning and allocates varying number of ways of a cache to different applications. The number of ways allocated are decided by measuring the miss rate of an application as a function of the number of ways allocated to it. This measurement is done using hardware monitoring circuits based on dynamic set sampling. However, way partitioning decreases the cache associativity available to an application and is only useful when the number of ways of the cache is greater than the number of partitions. Vantage [79] removes the ineffi- ciencies of way-partitioning using a highly-associative cache (ZCache [78]) and relying on statistical analysis to provide strong guarantees and bounds on associativity and sizing. Also, they improve performance by partitioning only 90% of the cache while retaining 10% unpartitioned for dynamic application needs. Cooperative caching [22] starts with a private LLC baseline to minimize hit latency but spills evicted lines into the caches of adjacent cores to enable better capacity sharing. However, this scheme does not take into account the utility of the spilled cache lines and can work very badly with streaming applications as well as applications with huge working set sizes. Adaptive spill-receive [74] marks each private cache as a spiller or receiver (which receives spilled cache lines). The configuration of a cache is decided based on set dueling. Set dueling allocates a few sets of each cache to always spill or always receive and calculates the cache miss rate incurred by these sets. The rest of the sets are marked as spill or receive based on which does better. This scheme does not scale beyond a few cores because of the inherent bottlenecks in set dueling. Adaptive set-granular cooperative caching [77] marks each set of a cache as spill or receiver or neutral, thus enabling cooperative caching at a very fine-grained level. A set marked as neutral does not spill lines or receive spilled lines. The cache to spill to is decided once per set using broadcast and its location is recorded locally for future spills. However, this scheme cannot allocate more than twice the private cache size to an application. 177 The locality-aware adaptive cache coherence and data replication techniques are orthogonal to the above proposals. However, for data replication, the monitoring circuit used by utility-based cache partitioning is adapted in one of the methods to measure the miss rate incurred by alternate replication cluster sizes. 8.8 Memory Consistency Models Speculatively relaxing memory order was proposed with load speculation in MIPS R10K [941. Full speculation for SC execution has been studied as well 134, 84]. In- visiFence [15], Store-Wait-Free [91] and BulkSC 119] are proposals that attempt to accelerate existing memory models by reducing both buffer capacity and ordering related stalls. Timestamps have been used to implement memory models [30, coherence verification [731. 85] and for cache This thesis, however, is the first to identify and solve the problems associated with using remote accesses that make data access efficient in large scale multicores. 8.9 On-Chip Network and DRAM Performance Memory bottlenecks in multicores could be alleviated using newer technologies such as embedded DRAM [69] and 3D stacking [62] as well as by using intelligent memory scheduling techniques [8]. Network bottlenecks could be reduced by using newer technologies such as photonics [56 as well as by using better topologies [82], router bypassing [511, adaptive routing schemes [27J and intelligent flow control [27] techniques. Both memory and network bottlenecks could also be alleviated using intelligent cache hierarchy management. Better last-level cache (LLC) partitioning [76, 12] and replacement schemes [75, 46] have been proposed to reduce memory pressure. Better cache replication [39, 97, 10], placement [39, 12] and allocation [04] schemes have been proposed to exploit application data locality and reduce network traffic. Our proposed 178 extension for locality-aware cache coherence ensures that it can be implemented in multicore processors with popular memory models alongside any of the above schemes. 179 180 Chapter 9 Conclusion Applications exhibit varying reuse behavior towards cache lines and this behavior can be exploited to improve performance and energy efficiency through selective replication of cache lines. Replicating high reuse lines improves access latency and energy while suppressing the replication of low reuse lines reduces data movement and improves cache utilization. No correlation between reuse and cache line type (e.g., private data, shared read-only and read-write data) for data access has been observed experimentally. (Instructions however, have good reuse.) Hence, a replication policy based on data classification does not produce optimal results. The varying reuse behavior towards cache lines has been observed at both the Li and L2 cache levels. This enables a reuse-based replication scheme to be applied to all levels of a cache hierarchy. Exploiting variable reuse behavior at the Li cache also enables the transfer of just the requested word to/from the remote L2 cache location. This is more energy-efficient than transferring the entire contents of a low-reuse cache line. Such a data access technique is called 'remote access'. In processors with good single-thread performance, it is important to use speculation and prefetching to retain load/store performance when strong memory consistency models are required. State-of-the-art processors rely on cache coherence messages (invalidations /evictions) to detect speculation violations. However, such coherence messages are avoided with remote accesses, necessitating an alternate technique to detect violations. This thesis proposes a novel timestamp-based technique 181 to detect speculation violations. This technique can be applied to loads that access data directly at the private L1-D cache as well, obviating the need for cache coherence messages to detect violations. The timestamp mechanism is efficient due to the observation that consistency violations only occur due to conflicting accesses that have temporal proximity (i.e., within a few cycles of each other), thus requiring timestamps to be stored only for a small time window. The timestamp-based validation technique was found to produce results close to that of an ideal scheme that does not suffer from speculation violations. Providing scalable cache coherence is of paramount importance to computer architects since it enables to preserve the familiar programming paradigm of shared memory in multicore processors with ever-increasing core counts. This thesis pro- poses the ACKwise protocol that provides scalable cache coherence in co-ordination with a network with broadcast support. Simple changes to the routing protocol of a mesh network are proposed to support broadcasts. No virtual channels are added. ACKwise also supports high degrees of read sharing without any overheads. ACKwise works in synergy with the locality-aware cache replication schemes. The locality-aware schemes prevent replication of low-reuse data, thereby reducing its number of private sharers. This potentially reduces the number of occasions when invalidation broadcasts are required. Employing broadcasts for high-reuse data does not harm efficiency since such data has a large lifetime in the cache. The principal thesis contributions are summarized next and later, opportunities for future work are discussed. 9.1 Thesis Contributions This thesis makes the following five important contributions that holistically address performance, energy efficiency and programmability. 1. Proposes a scalable limited directory-based coherence protocol, ACKwise [56, W] that reduces the directory storage needed to track the sharers of a data block. 182 2. Proposes a Locality-aware Private Cache Replication scheme [54 to better manage the private caches in multicore processors by intelligently controlling data caching and replication. 3. Proposes a Timestamp-based Memory Ordering Validation technique that enables the preservation of familiar memory consistency models when the intelligent private cache replication scheme is applied to production processors. 4. Proposes a Locality-aware LLC Replication scheme [53] that better manages the last-level shared cache (LLC) in multicore processors by balancing shared data (and instruction) locality and off-chip miss rate through controlled replication. 5. Proposes a Locality-aware Cache Hierarchy Management Scheme that seamlessly combines all the above schemes to provide an optimal combination of data locality and miss rate at all levels of the cache hierarchy. On a 64-core multicore processor with out-of-order cores, Locality-aware Cache Hierarchy Replication improves completion time by 15% and energy by 22% while incurring a storage overhead of 30.5 KB per core (i.e., ~10% the aggregate cache capacity of each core). 9.2 Future Directions There are three potential future directions for this work. The first combines hardware and software techniques to perform intelligent replication. The second reduces the storage overhead of the locality-aware replication schemes by potentially exploiting the correlation between the reuse exhibited by cache lines belonging to the same page or instruction address. The third reduces the dependence on the timestamp-based scheme by exploring an optimized variant of the non-conflicting scheme described earlier. This scheme works by classifying pages into private, shared read-only and shared read-write on a time interval basis, allowing back transitions from read-write to read-only and shared to private. These potential future directions are discussed below. 183 9.2.1 Hybrid Software/Hardware Techniques The programmer could designate certain data structures or program code as having potential benefits if replicated. The hardware would then be responsible for enforcing these software hints. In this case, the intelligence to decide which cache lines should be replicated is delegated to software. However, the hardware would still be required to implement the various schemes in a deadlock-free manner so that starvation freedom and forward progress can be ensured. 9.2.2 Classifier Compression Compression techniques to reduce the storage overhead needed for the classifier need to be explored. One method is to explore the correlation between the reuse of cache lines that belong to the same page or instruction. If such a correlation exists, then locality tracking structures needs to be maintained only on a per-page or a perinstruction basis. 9.2.3 Optimized Variant of Non-Conflicting Scheme One simple way to implement a memory consistency model is to enable loads/store to non-conflicting data (i.e., private, and shared read-only data) to be issued and completed out-of-order. If most data accesses are to private and shared read-only data, such an implementation would be efficient. This requires an efficient classifier that dynamically transitions pages from shared to private and read-write to read-only when the opportunity arises. One way to achieve this is to classify pages on a time interval basis and reset the assigned labels periodically. 184 Bibliography [1] The sparc architecture manual, v. 8. spare international, inc. http: //www. sparc.org/standards/V8.pdf, 1992. [2] First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008. [3] DARPA UHPC Program (DARPA-BAA-10-37), March 2010. exasfor CPU Landing Knights x86 72-core unveils [4] Intel http://www.extremetech.com/extreme/ supercomputing. cale 171678-intel-unveils-72-core-x86-knights-landing-cpu-for-exascale/ supercomputing, November 2013. [5] Santosh G. Abraham, Rabin A. Sugumar, Daniel Windheiser, B. R. Rau, and Rajiv Gupta. Predictability of load/store instruction latencies. In Proceedings of the 26th annual internationalsymposium on Microarchitecture, MICRO 26, pages 139-152, Los Alamitos, CA, USA, 1993. IEEE Computer Society Press. [6] Anant Agarwal, Richard Simoni, John L. Hennessy, and Mark Horowitz. An Evaluation of Directory Schemes for Cache Coherence. In International Symposium on Computer Architecture, 1988. [7] Jade Alglave, Daniel Kroening, Vincent Nimal, and Daniel Poetzl. Don't sit on the fence: A static analysis approach to automatic fence insertion. CoRR, abs/1312.1411, 2013. [81 Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H. Loh, and Onur Mutlu. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 416-427, Washington, DC, USA, 2012. IEEE Computer Society. [9] Luiz Andr6 Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. Piranha: a scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th annual international symposium on Computer architecture, ISCA '00, pages 282-293, New York, NY, USA, 2000. ACM. 185 [101 Bradford M. Beckmann, Michael R. Marty, and David A. Wood. ASR: Adaptive Selective Replication for CMP Caches. In Proceedings of the 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 39, pages 443-454, 2006. [11] Bradford M. Beckmann and David A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th annual IEEE/ACM InternationalSymposium on Microarchitecture,MICRO 37, pages 319-330, Washington, DC, USA, 2004. IEEE Computer Society. [121 Nathan Beckmann and Daniel Sanchez. Jigsaw: Scalable Software-defined Caches. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, pages 213-224, Piscataway, NJ, USA, 2013. IEEE Press. [131 S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. Tile64 - processor: A 64-core soc with mesh interconnect. In InternationalSolid-State Circuits Conference, 2008. 114] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In International Conference on Parallel Architectures and Compilation Techniques, 2008. [15] Colin Blundell, Milo M. K. Martin, and Thomas F. Wenisch. Invisifence: Performance-transparent memory ordering in conventional multiprocessors. In In Proc. 36th Intl. Symp. on Computer Architecture, 2009. [161 S. Borkar. Panel on State-of-the-art Electronics. NSF Workshop on Emerging Technologies for Interconnects http: //weti. cs. ohiou. edu/, 2012. [171 Shekhar Borkar. Thousand core chips: a technology perspective. Automation Conference, 2007. In Design [181 L. M. Censier and P. Feautrier. A new solution to coherence problems in multicache systems. IEEE Trans. Comput., 27(12):1112-1118, December 1978. [19] Luis Ceze, James Tuck, Pablo Montesinos, and Josep Torrellas. BulkSC: Bulk Enforcement of Sequential Consistency. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 278-289, New York, NY, USA, 2007. ACM. [20] David Chaiken, Craig Fields, Kiyoshi Kurihara, and Anant Agarwal. Directorybased cache coherence in large-scale multiprocessors. Computer, 23(6):49-58, June 1990. 186 [211 David Chaiken, John Kubiatowicz, and Anant Agarwal. LimitLESS Directories: A Scalable Cache Coherence Scheme. In Proceedings of the Fourth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS IV, pages 224-234, New York, NY, USA, 1991. ACM. [22] Jichuan Chang and Gurindar S. Sohi. Cooperative Caching for Chip Multipro- cessors. In ISCA, 2006. [23] Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th annual IEEE ACM InternationalSymposium on Microarchitecture, MICRO 36, pages 55-, Washington, DC, USA, 2003. IEEE Computer Society. [24] Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. Optimizing replication, communication, and capacity allocation in cmps. In Proceedings of the 32nd annual internationalsymposium on Computer Architecture, ISCA '05, pages 357-368, 2005. [251 Sangyeun Cho and Lei os-level page allocation. national Symposium on ington, DC, USA, 2006. Jin. Managing distributed, shared 12 caches through In Proceedings of the 39th Annual IEEE ACM InterMicroarchitecture, MICRO 39, pages 455-468, WashIEEE Computer Society. [261 Pat Conway, Nathan Kalyanasundharam, Gregg Donley, Kevin Lepak, and Bill Hughes. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor. IEEE Micro, 30(2), March 2010. [27] William Dally and Brian Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. [281 Abhishek Das, Matthew Schuchhardt, Nikos Hardavellas, Gokhan Memik, and Alok N. Choudhary. Dynamic directories: A mechanism for reducing on-chip interconnect power in multicores. In DATE, pages 479-484, 2012. [291 Noel Eisley, Li-Shiuan Peh, and Li Shang. In-network cache coherence. In IEEE ACM InternationalSymposium on Microarchitecture,MICRO 39, pages 321-332, 2006. [30] M. Elver and V. Nagarajan. Tso-cc: Consistency directed cache coherence for tso. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th InternationalSymposium on, pages 165-176, Feb 2014. [31] C. Fensch and M. Cintra. An os-based alternative to full hardware coherence on tiled cmps. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th InternationalSymposium on, pages 355-366, 2008. 187 132] Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. Cuckoo directory: A scalable directory for many-core systems. In 17th IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 169-180, 2011. [331 Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Two techniques to enhance the performance of memory consistency models. In In Proceedings of the 1991 InternationalConference on ParallelProcessing, pages 355-364, 1991. [341 Chris Gniady, Babak Falsafi, and T. N. Vijaykumar. Is SC + ILP = RC? In Proceedings of the 26th Annual InternationalSymposium on Computer Architecture, ISCA '99, pages 162-171, Washington, DC, USA, 1999. IEEE Computer Society. [35] Antonio Gonzdlez, Carlos Aliagas, and Mateo Valero. A data cache with multiple caching strategies tuned to different types of locality. In Proceedings of the 9th internationalconference on Supercomputing, ICS '95, pages 338-347, New York, NY, USA, 1995. ACM. [36] Peter Greenhalgh. big.little processing with arm cortex-a15 & cortex-a7. http: //www.arm.com/files/downloads/bigLITTLEFinalFinal.pdf, 2011. [37] Anoop Gupta, Wolf dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In In InternationalConference on Parallel Processing, pages 312-321, 1990. [38] P. Hammarlund, AJ. Martinez, AA Bajwa, D.L. Hill, E. Hallnor, Hong Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R.B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. Micro, IEEE, 34(2):6-20, Mar 2014. [39] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. Reactive NUCA: Near-optimal Block Placement and Replication in Distributed Caches. In Proceedings of the 36th Annual InternationalSymposium on Com- puter Architecture, ISCA '09, pages 184-195, New York, NY, USA, 2009. ACM. [40] Enric Herrero, Jos6 GonzAlez, and Ramon Canal. Elastic cooperative caching: an autonomous dynamically adaptive memory hierarchy for chip multiprocessors. In Proceedings of the 37th annual internationalsymposium on Computer architecture, ISCA '10, pages 419-428, New York, NY, USA, 2010. ACM. [41] Henry Hoffmann, David Wentzlaff, and Anant Agarwal. Remote store programming: A memory model for embedded multicore. In Proceedings of the 5th InternationalConference on High Performance Embedded Architectures and Compilers, HiPEAC'10, pages 3-17, Berlin, Heidelberg, 2010. Springer-Verlag. 188 [42] Jaehyuk Huh, Changkyu Kim, Hazim Shafi, Lixin Zhang, Doug Burger, and Stephen W. Keckler. A NUCA substrate for flexible CMP cache sharing. In Proceedings of the 19th annual internationalconference on Supercomputing, ICS '05, pages 31-40, New York, NY, USA, 2005. ACM. [431 S.M.Z. Iqbal, Yuchen Liang, and H. Grahn. ParMiBench - An Open-Source Benchmark for Embedded Multiprocessor Systems. Computer Architecture Let- ters, 9(2):45 -48, feb. 2010. [44] Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon C. Steely Jr., and Joel Emer. Achieving non-inclusive cache performance with inclusive caches: Temporal locality aware (TLA) cache management policies. In Int'l Symposium, on Microarchitecture, 2010. [45] Aamer Jaleel, William Hasenplaugh, Moinuddin Qureshi, Julien Sebot, Simon Steely, Jr., and Joel Emer. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th internationalconference on Parallel architectures and compilation techniques, PACT '08, pages 208-219, 2008. [46] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP). In Proceedings of the 37th Annual InternationalSymposium on Computer Architecture, ISCA '10, pages 60-71, New York, NY, USA, 2010. ACM. [47] Natalie Enright Jerger, Li-Shiuan Peh, and Mikko Lipasti. Virtual circuit tree multicasting: A case for on-chip hardware multicast support. In Int'l Symposium on Computer Architecture, 2008. [48] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 364-373, 1990. [49] Khakifirooz, A. and Nayfeh, O.M. and Antoniadis, D. A Simple Semiempirical Short-Channel MOSFET Current-Voltage Model Continuous Across All Regions of Operation and Employing Only Physical Parameters. Electron Devices, IEEE Transactions on, 56(8):1674 -1680, aug. 2009. [50] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, nonuniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th internationalconference on Architectural support for programming languages and operating systems, ASPLOS X, pages 211-222, New York, NY, USA, 2002. ACM. [51] Tushar Krishna, Chia-Hsin Owen Chen, Sunghyun Park, Woo Cheol Kwon, Suvinay Subramanian, Anantha Chandrakasan, and Li-Shiuan Peh. Single-cycle multihop asynchronous repeated traversal: A smart future for reconfigurable on-chip networks. Computer, 46(10):48-55, October 2013. 189 [52] Amit Kumar, Partha Kundu, Arvind P. Singh, Li-Shiuan Peh, and Niraj K. Jha. A 4.6tbits/s 3.6ghz single-cycle noc router with a novel switch allocator in 65nm cmos. In International Conference on Computer Design, 2007. [53] G. Kurian, S. Devadas, and 0. Khan. Locality-aware data replication in the last-level cache. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th InternationalSymposium on, pages 1-12, Feb 2014. [54] George Kurian, Omer Khan, and Srinivas Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 523-534, New York, NY, USA, 2013. ACM. [55] George Kurian, Omer Khan, and Srinivas Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 523-534, New York, NY, USA, 2013. ACM. [56] George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal. ATAC: A 1000-core Cache-coherent Processor with On-chip Optical Network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pages 477-488, New York, NY, USA, 2010. ACM. [57] George Kurian, Chen Sun, Chia-Hsin Owen Chen, Jason E. Miller, Jurgen Michel, Lan Wei, Dimitri A. Antoniadis, Li-Shiuan Peh, Lionel Kimerling, Vladimir Stojanovic, and Anant Agarwal. Cross-layer energy and performance evaluation of a nanophotonic manycore processor system using real application workloads. Parallel and Distributed Processing Symposium, 0:1117-1130, 2012. [58] An-Chow Lai, Cem Fide, and Babak Falsafi. Dead-block prediction & deadblock correlating prefetchers. In Proceedings of the 28th annual international symposium on Computer architecture, ISCA '01, pages 144-154, New York, NY, USA, 2001. ACM. [59] L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput., 28(9):690-691, September 1979. [60] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In International Symposium on Microarchitecture,2009. [61] Haiming Liu, Michael Ferdman, Jaehyuk Huh, and Doug Burger. Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture,MICRO 41, pages 222-233, 2008. 190 [62] G.H. Loh. 3d-stacked memory architectures for multi-core processors. In Computer Architecture, 2008. ISCA '08. 35th International Symposium on, pages 453-464, June 2008. [63] Meredydd Luff and Simon Moore. Asynchronous remote stores for interprocessor communication. In Future Architectural Support for Parallel Pro- gramming (FASPP), 2012. [641 Yeong-Chang Maa, Dhiraj K. Pradhan, and Dominique Thiebaut. Two economical directory schemes for large-scale cache coherent multiprocessors. SIGARCH Comput. Archit. News, 19(5):10-, September 1991. [65] Sela Mador-Haim, Luc Maranget, Susmit Sarkar, Kayvan Memarian, Jade Alglave, Scott Owens, Rajeev Alur, Milo M. K. Martin, Peter Sewell, and Derek Williams. An axiomatic memory model for power multiprocessors. In Proceedings of the 24th International Conference on Computer Aided Verification, CAV'12, pages 495-512, Berlin, Heidelberg, 2012. Springer-Verlag. [66] Milo M. K. Martin, Mark D. Hill, and Daniel J. Sorin. Why on-chip cache coherence is here to stay. Commun. ACM, 55(7):78-89, 2012. [671 Scott McFarling. Cache replacement with dynamic exclusion. In Proceedings of the 19th annual internationalsymposium on Computer architecture, ISCA '92, pages 191-200, New York, NY, USA, 1992. ACM. [68] Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald III, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. Graphite: A Distributed Parallel Simulator for Multicores. In International Symposium on High-Performance Computer Architecture, 2010. [69] Sparsh Mittal, Jeffrey S. Vetter, and Dong Li. Improving energy efficiency of embedded dram caches for high-end computing systems. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 99-110, New York, NY, USA, 2014. ACM. [70] Brian W. O'Krafka and A. Richard Newton. An empirical evaluation of two memory-efficient directory methods. In Proceedings of the 17th annual international symposium on Computer Architecture, ISCA '90, pages 138-147, New York, NY, USA, 1990. ACM. [71] Jongsoo Park, Richard M. Yoo, Daya S. Khudia, Christopher J. Hughes, and Daehyun Kim. Location-aware cache management for many-core processors with deep cache hierarchy. In Proceedings of SC13: International Conference for High Performance Computing, Networking, Storage and Analysis, SC '13, pages 20:1-20:12, New York, NY, USA, 2013. ACM. [72] Dadi Perlmutter. Introducing next generation low power microarchitecture: Silvermont. http: //files. shareholder. com/downloads/INTC/0x0x660894/ 191 f3398730-60e7-44bc-a92f-9a9a652f11c9/2013_SilvermontFINAL-Mon_ Harbor.pdf, 2013. [73] Manoj Plakal, Daniel J. Sorin, Anne E. Condon, and Mark D. Hill. Lam- port clocks: Verifying a directory cache-coherence protocol. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 67-76, New York, NY, USA, 1998. ACM. [74] M.K. Qureshi. Adaptive spill-receive for robust high-performance caching in cmps. In High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th InternationalSymposium on, pages 45 -54, feb. 2009. 175] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. Adaptive Insertion Policies for High Performance Caching. In Proceedings of the 34th Annual InternationalSymposium on Computer Architecture, ISCA '07, pages 381-391, New York, NY, USA, 2007. ACM. [76] Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 423-432, Washington, DC, USA, 2006. IEEE Computer Society. [77] Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo. Adaptive set-granular cooperative caching. In Proceedings of the 2012 IEEE 18th InternationalSymposium on High-Performance Computer Architecture, HPCA '12, pages 1-12, Washington, DC, USA, 2012. IEEE Computer Society. [781 Daniel Sanchez and Christos Kozyrakis. The ZCache: Decoupling Ways and Associativity. In Proceedings of the 2010 43rd Annual IEEE ACM International Symposium on Microarchitecture,MICRO '43, pages 187-198, Washington, DC, USA, 2010. IEEE Computer Society. [79] Daniel Sanchez and Christos Kozyrakis. Vantage: scalable and efficient finegrain cache partitioning. In ISCA, pages 57-68, 2011. [80] Daniel Sanchez and Christos Kozyrakis. SCD: A Scalable Coherence Directory with Flexible Sharer Set Encoding. In International Symp. on HighPerformance Computer Architecture, 2012. [81] Daniel Sanchez and Christos Kozyrakis. Zsim: Fast and accurate microarchitectural simulation of thousand-core systems. In Proceedings of the 40th Annual InternationalSymposium on Computer Architecture, ISCA '13, pages 475-486, New York, NY, USA, 2013. ACM. [82] Daniel Sanchez, George Michelogiannakis, and Christos Kozyrakis. An Analysis of On-chip Interconnection Networks for Large-scale Chip Multiprocessors. ACM Trans. Archit. Code Optim., 7(1):4:1-4:28, May 2010. 192 [83] Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus 0. Myreen. X86-tso: A rigorous and usable programmer's model for x86 multiprocessors. Commun. ACM, 53(7):89-97, July 2010. [841 Abhayendra Singh, Satish Narayanasamy, Daniel Marino, Todd Millstein, and Madanlal Musuvathi. End-to-end sequential consistency. In Proceedings of the 39th Annual International Symposium. on Computer Architecture, ISCA '12, pages 524-535, Washington, DC, USA, 2012. IEEE Computer Society. [85] Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, and Tor M. Aamodt. Cache coherence for gpu architectures. In Proceedings of the 2013 IEEE 19th InternationalSymposium on High Performance Computer Architecture (HPCA), HPCA '13, pages 578-590, Washington, DC, USA, 2013. IEEE Computer Society. [86] Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Consistency and Cache Coherence. Synthesis Lectures in Computer Architecture, Morgan Claypool Publishers, 2011. [87] Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In InternationalSymposium on Networks-on-Chip, 2012. [88] Sun Microsystems. UltraSPARC T2 supplement to the UltraSPARC architecture 2007. Technical Report, 2007. [89] Gary Tyson, Matthew Farrens, John Matthews, and Andrew R. Pleszkun. A modified approach to data cache management. In InternationalSymposium on Microarchitecture,pages 93-103, 1995. [90] Lan Wei, F. Boeuf, T. Skotnicki, and H.-S.P. Wong. Parasitic Capacitances: Analytical Models and Impact on Circuit-Level Performance. Electron Devices, IEEE Transactions on, 58(5):1361 -1370, may 2011. [91] Thomas F. Wenisch, Anastasia Ailamaki, Babak Falsafi, and Andreas Moshovos. Mechanisms for store-wait-free multiprocessors. In Proceedings of the 34th Annual InternationalSymposium on Computer Architecture, ISCA '07, pages 266-277, New York, NY, USA, 2007. ACM. [921 David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. On-chip interconnection architecture of the tile processor. IEEE Micro, 27(5):15-31, 2007. [93] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 Programs: Characterization and Methodological Considerations. In InternationalSymposium on Computer Architecture, 1995. 193 [94] K.C. Yeager. The Mips R10000 superscalar microprocessor. Micro, IEEE, 16(2):28-41, Apr 1996. [95] T. Yemliha, S. Srikantaiah, M. Kandemir, M. Karakoy, and M.J. Irwin. Integrated code and data placement in two-dimensional mesh based chip multiprocessors. In Computer-Aided Design, 2008. ICCAD 2008. IEEE ACM International Conference on, pages 583 -588, nov. 2008. [96] Jason Zebchuk, Vijayalakshmi Srinivasan, Moinuddin K. Qureshi, and Andreas Moshovos. A tagless coherence directory. In Proceedings of the 42Nd Annual IEEE ACM InternationalSymposium on Microarchitecture,MICRO 42, pages 423-434, New York, NY, USA, 2009. ACM. [97] Michael Zhang and Krste Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In Proceedings of the 32Nd Annual International Symposium on Computer Architecture, ISCA '05, pages 336-345, Washington, DC, USA, 2005. IEEE Computer Society. [98] Yuanrui Zhang, Wei Ding, Mahmut Kandemir, Jun Liu, and Ohyoung Jang. A Data Layout Optimization Framework for NUCA-based Multicores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44 '11, pages 489-500, New York, NY, USA, 2011. ACM. [99] Hongzhou Zhao, Arrvindh Shriraman, and Sandhya Dwarkadas. SPACE: sharing pattern-based directory coherence for multicore scalability. In International Conference on Parallel Architectures and Compilation Techniques, pages 135146, 2010. [100] Hongzhou Zhao, Arrvindh Shriraman, Sandhya Dwarkadas, and Vijayalakshmi Srinivasan. SPATL: Honey, I Shrunk the Coherence Directory. In International Conference on ParallelArchitectures and Compilation Techniques, 2011. 194