Directoryless Shared Memory Architecture using Thread Migration and Remote Access by Keun Sup Shim Bachelor of Science, Electrical Engineering and Computer Science, KAIST, 2006 Master of Science, Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2010 Submitted to the Department of Electrical Engineering and Computer Science A60{NEq in partial fulfillment of the requirements for the degree of MASSACHUSETTS INS OF TECHNOLOGY Doctor of Philosophy JUN at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY L RARIES June 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. Signature redacted A uth or ................................. Department of Electrical Engineering and Computer Science (.VIay 11, 2014 g Certified by...........................Signature redacted Srinivas Devadas Edwin Sibley Webster Professor Thesis Supervisor Accepted by...................... Signature redacted_ r d ce /eslie /A. 2 olodziejski Chair, Department Committee on Graduate Students rT0 E1 2 Directoryless Shared Memory Architecture using Thread Migration and Remote Access by Keun Sup Shim Submitted to the Department of Electrical Engineering and Computer Science on May 14, 2014, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Chip multiprocessors (CMPs) have become mainstream in recent years, and, for scalability reasons, high-core-count designs tend towards tiled CMPs with physically distributed caches. In order to support shared memory, current many-core CMPs maintain cache coherence using distributed directory protocols, which are extremely difficult and error-prone to implement and verify. Private caches with directory-based coherence also provide suboptimal performance when a thread accesses large amounts of data distributed across the chip: the data must be brought to the core where the thread is running, incurring delays and energy costs. Under this scenario, migrating a thread to data instead of the other way around can improve performance. In this thesis, we propose a directoryless approach where data can be accessed either via a round-trip remote access protocol or by migrating a thread to where data resides. While our hardware mechanism for fine-grained thread migration enables faster migration than previous proposals, its costs still make it crucial to use thread migrations judiciously for the performance of our proposed architecture. We, therefore, present an on-line algorithm which decides at the instruction level whether to perform a remote access or a thread migration. In addition, to further reduce migration costs, we extend our scheme to support partial context migration by predicting the necessary thread context. Finally, we provide the ASIC implementation details as well as RTL simulation results of the Execution Migration Machine (EM 2 ), a 110-core directoryless shared-memory processor. Thesis Supervisor: Srinivas Devadas Title: Edwin Sibley Webster Professor 3 4 Acknowledgments First and foremost, I would like to express my deepest gratitude to my advisor, Professor Srinivas Devadas, who has offered me full support and has been a tremendous mentor throughout my Ph.D. years. I feel very fortunate to have had the opportunity to work with him and learn from him. His energy and insight will continue to inspire me throughout my career. I would also like to thank my committee members Professor Arvind and Professor Daniel Sanchez. They both provided me with invaluable feedback and advice that helped me to develop my thesis more thoroughly. I am especially grateful to Arvind for being accessible as a counselor as well, and to Daniel for always being an inspiration to me for his passion in this field. I truly thank another mentor of mine, Professor Joel Emer. From Joel, I learned not only about the core concepts of computer architecture but also about teaching. I feel very privileged for having been a teaching assistant for his class. While I appreciate all of my fellow students in the Computation Structures Group at MIT, I want to express special thanks to Mieszko Lis and Myong Hyon Cho. We were great collaborators on the EM2 tapeout project, and at the same time, awesome friends during our doctoral years. It was a great pleasure for me to work with such talented and fun people. Getting through my dissertation required more than academic support. Words cannot express my gratitude and appreciation to my friends from Seoul Science High School and KAIST at MIT. I am also grateful to my friends at Boston Onnuri Church for their prayers and encouragement. I would also like to extend my deep gratitude to Samsung Scholarship for supporting me financially during my doctoral study. My fianc6e Song-Hee deserves my special thanks for her love and care. She has believed in me more than I did myself and her consistent support has always kept me energized and made me feel that I am never alone. I cannot thank my parents and family enough; they have always believed in me, and have been behind me throughout my entire life. Lastly, I thank God, for offering me so many opportunities in my life 5 and giving me the strength and wisdom to fully enjoy them. 6 Contents 1 2 3 Introduction 17 1.1 Large-Scale Chip Multiprocessors 1.2 Shared Memory for Large-Scale CMPs . . . . . . . . . . . . . . . . . 18 1.3 Motivation for Fine-grained Thread Migration . . . . . . . . . . . . . 19 1.4 Motivation for Directoryless Architecture . . . . . . . . . . . . . . . . 21 1.5 Previous Works on Thread Migration . . . . . . . . . . . . . . . . . . 22 1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . 17 Directoryless Architecture 27 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Remote Cache Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3 Hardware-level Thread Migration 29 2.4 Performance Overhead of Thread Migration 2.5 Hybrid Memory Access Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 . . . . . . . . . . . . . . . . . . . 31 Thread Migration Prediction 33 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Thread Migration Predictor 33 3.3 . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Per-core Thread Migration Predictor 3.2.2 Detecting Migratory Instructions: WHEN to migrate . . . . . 35 3.2.3 Possible Thrashing in the Migration Predictor . . . . . . . . . 38 Experim ental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 40 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 33 3.3.2 3.4 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Perform ance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter Summary 44 . . . . . . . . . . . . . . . . . . . . . . . . . . .. Partial Context Migration for General Register File Architecture 47 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2 Partial Context Thread Migration . . . . . . . . . . . . . . . . . . . . 48 . . . . . . . . . . . . . . . . . 48 . . . . . . . 49 4.3 4.2.1 Extending Migration Predictor 4.2.2 Detection of Useful Registers: WHAT to migrate 4.2.3 Partial Context Migration Policy . . . . . . . . . . . . . . . . 51 4.2.4 Misprediction handling . . . . . . . . . . . . . . . . . . . . . . 53 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 . . . . . . . . . . . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . . . 57 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3.1 4.4 4.5 5 42 Simulation Results 3.4.1 3.5 . . . . . . . . . . . . . . . . . . . . . . . . Evaluated Systems Evaluated Systems Simulation Results 4.4.1 Performance and Network Traffic 4.4.2 The Effects of Network Parameters Chapter Summary 61 The EM' silicon implementation 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 EM2 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2.2 Tile architecture . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.2.3 Stack-based core architecture . . . . . . . . . . . . . . . . . . 64 5.2.4 Thread migration implementation . . . . . . . . . . . . . . . . 65 5.2.5 The instruction set . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2.6 System configuration and bootstrap . . . . . . . . . . . . . . . 69 5.2.7 Virtual memory and OS implications . . . . . . . . . . . . . . 70 5.3 Migration Predictor for EM 2 5.3.1 . . . . . . . . . . . . . . . . . . . . . . Stack-based Architecture variant 8 . . . . . . . . . . . . . . . . . 70 70 5.4 5.5 5.6 5.7 6 5.3.2 Partial Context Migration Policy 5.3.3 Implementation Details . . . . . . . . . . . . . . . . . 74 Physical Design of the EM 2 Processor . . . . . . . . . . . . . 76 5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.2 Tile-level . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.3 Chip-level . . . . . . . . . . . . . . . . . . . . . . . . 78 . . . . . . . . . . . . . . . . . . . . . . 79 5.5.1 RTL simulation . . . . . . . . . . . . . . . . . . . . . 79 5.5.2 Area and power estimates . . . . . . . . . . . . . . . 81 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.6.1 Performance tradeoff factors . . . . . . . . . . . . . . 82 5.6.2 Benchmark performance . . . . . . . . . . . . . . . . 84 5.6.3 Area and power costs . . . . . . . . . . . . . . . . . . 91 5.6.4 Verification Complexity . . . . . . . . . . . . . . . . 92 . . . . . . . . . . . . . . . . . . . . . . . 94 Evaluation Methods Chapter Summary 73 Conclusions 97 6.1 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Architectural assumptions and their implications . . . . . . . . . . . . 98 6.3 Future avenues of research . . . . . . . . . . . . . . . . . . . . . . . . 100 Bibliography 101 A Source-level Read-only Data Replication 107 9 10 List of Figures 1-1 Rationale of moving computation instead of data . . . . . . . . . . . 20 2-1 Hardware-level thread migration via the on-chip interconnect . . . . . 31 2-2 Hybrid memory access framework for our directoryless architecture . 32 3-1 Hybrid memory access architecture with a thread migration predictor . . . . . . . . . . . . . . . . . . . . . . . . 34 on a 5-stage pipeline core. 3-2 An example of how instructions (or PC's) that are followed by consecutive accesses to the same home location (i.e., migratory instructions) are detected in the case of the depth threshold 3-3 0 = 2. . . . . . . . . . 36 An example of how the decision between remote access and thread migration is made for every memory access. . . . . . . . . . . . . . . 38 3-4 Parallel K-fold cross-validation using perceptron . . . . . . . . . . . . 40 3-5 Core miss rate and its breakdown into remote access rate and migration rate 3-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Parallel completion time normalized to the remote-access-only architec- ture (N oDirRA ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3-7 Network traffic normalized to the remote-access-only architecture (NoDirRA) 45 4-1 Hardware-level thread migration with partial context migration support 4-2 A per-core PC-based migration predictor, where each entry contains a {PC, register mask} pair . . . . . . . . . . . . . . . . . . . . . . . . . 11 48 49 4-3 An example how registers being read/written are kept track of and how the information is inserted into the migration predictor when a specific instruction (or PC) is detected as a migratory instruction (the depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4-4 An example of a partial context thread migration. . . . . . . . . . . . 52 4-5 Parallel completion time normalized to DirCC . . . . . . . . . . . . . 55 4-6 Network traffic normalized to DirCC . . . . . . . . . . . . . . . . . . 56 4-7 Breakdown of Li miss rate . . . . . . . . . . . . . . . . . . . . . . . . 57 4-8 Core miss rate for directoryless systems . . . . . . . . . . . . . . . . . 58 4-9 Network traffic breakdown . . . . . . . . . . . . . . . . . . . . . . . . 59 4-10 Breakdown of migrated context into used and unused registers . . . . 60 threshold 0 = 2). 4-11 The effect of network latency and bandwidth on performance and network traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5-1 Chip-level layout of the 110-core EM 2 chip . . . . . . . . . . . . . . . 62 5-2 EM 2 Tile Architecture 63 5-3 The stack-based processor core diagram of EM 2. . . . . . . . . . . .. 5-4 Hardware-level thread migration via the on-chip interconnect under . . . . . . . . . . . . . . . . . . . . . . . . . . 64 EM 2 . Only the main stack is shown for simplicity. . . . . . . . . . . . 66 5-5 The two-stage scan chain used to configure the EM 2 chip . . . . . . . 69 5-6 Integration of a PC-based migration predictor into a stack-based, twostage pipelined core of EM2. . . . . . . . . . . . . . . . . . . . . . . . 71 . . . . . . . 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Die photo of the 110-core EM 2 chip . . . . . . . . . . . . . . . . . . . 79 5-10 Thread migration (EM 2 ) vs Remote access (RA) . . . . . . . . . . . . 82 5-11 Thread migration (EM 2 ) vs Private caching (CC) . . . . . . . . . . . 83 5-12 The effect of distance on RA, CC and EM 2 . . . . . . . . . . . . . . . . 84 5-13 The evaluation of EM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5-14 Thread migration statistics under EM 2 . . . . . . . . . . . . . . . . . 86 5-7 Decision/Learning mechanism of the migration predictor 5-8 EM 2 Tile Layout 5-9 12 5-15 Performance and network traffic with different number of threads for tbscan under EM 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5-16 N instructions before being evicted from a guest context under EM 2 . 88 5-17 EM 2 allows efficient bulk loads from a remote core. 90 . . . . . . . . . . 5-18 Relative area and leakage power costs of EM 2 vs. estimates for exactsharer CC with the directory sized to 100% and 50% of the D$ entries (DC Ultra, IBM 45nm SOI hvt library, 800MHz). 5-19 Bottom-up verification methodology of EM 2 13 . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 92 93 14 List of Tables 3.1 System configurations used . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Interface ports of the migration predictor in EM 2 . . . . . . . . . . . . 75 5.2 Power estimates of the EM 2 tile (reported by Design Compiler) 78 5.3 A summary of architectural costs that differ in the EM 2 and CC implemen- . . . tation s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 The total number of changed code lines . . . . . . . . . . . . . . . . . 15 91 110 16 Chapter 1 Introduction 1.1 Large-Scale Chip Multiprocessors For the past decades, CMOS scaling has been a driving force of computer performance improvements. The number of transistors on a single chip has doubled roughly every 18 months (known as Moore's law [44]), and along with Dennard scaling [21], we could improve the processor performance without hitting the power wall [21]. Starting from the mid-2000's, however, supply voltage scaling has stopped due to higher leakage, and power limits have halted the drive to higher core frequencies. Unlike the end of Dennard scaling, transistor density has continued to grow [25]. As increasing instruction-level parallelism (ILP) of a single-core processor became less efficient, computer architects have turned to multicore architectures rather than more complex uniprocessor architectures to better utilize the available transistors for overall performance. And since 2005 when we have had dual-core processors in the market, Chip Multiprocessors (CMPs) with more than one core on a single chip have already become common in the commodity and general-purpose processor markets [50,56]. To further improve performance, architects are now resorting to medium and largescale multicores. In addition to multiprocessor projects in academia (e.g., RAW [58], TRIPS [52]), Intel demonstrated its 80-tile TeraFLOPS research chip in 65-nm CMOS in 2008 [57], followed by the 48-core SCC processor in 45-nm technology, the second processor in the TeraScale Research program [31]. In 2012, Intel introduced its first 17 Many Integrated Core (MIC) product which has over 60 cores to the market as the Intel Xeon Phi family [29], and it has recently announced a 72-core x86 Knights Landing CPU [30]. Tilera Corporation has shipped its first multiprocessor, TILE64 [7, 59], which connects 64 tiles with 2-D mesh networks, in 2007; the company has further announced TILE-Gx72 which implements 72 power-efficient processor cores and is suited for many compute and I/O-intensive applications [17]. Adapteva also announced its 64-core 28-nm microprocessor based on its Epiphany architecture which supports shared memory and uses a 2D mesh network [48]. As seen by many examples, processor manufacturers are already able to place tens and hundreds of cores on a single chip, and industry pundits are predicting 1000 or more cores in a few years [2,8,61]. 1.2 Shared Memory for Large-Scale CMPs For manycore CMPs, each core typically has per-core Li and L2 caches since power requirements of caches grow quadratically with size; therefore, the only practical option to implement a large on-chip cache is to physically distribute cache on the chip so that every core is near some portion of the cache [7,29]. And since conventional bus and crossbar interconnects no longer scale due to the bandwidth and area limitations [45,46], these cores are often connected via an on-chip interconnect, forming a tiled architecture (e.g., Raw [58], TRIPS [52], Tilera [7], Intel TeraFLOPS [57], Adapteva [48]). How will these manycore chips be programmed? Programming convenience provided by the shared memory abstraction has made it the most popular paradigm for general-purpose parallel programming. While architectures with restricted memory models (most notably GPUs) have enjoyed immense success in specific applications (such as rendering graphics), most programmers prefer a shared memory model [55], and commercial general-purpose multicores have supported this abstraction in hardware. The main question, then, is how to efficiently provide coherent shared memory on the scale of hundreds or thousands of cores. Providing a full shared-memory abstraction requires cache coherence, which is 18 traditionally implemented by bus-based snooping or a centralized directory for CMPs with relatively few cores. For large-scale CMPs where bus-based mechanisms fail, however, snooping and centralized directories are no longer viable, and such many-core systems commonly provide cache coherence via distributed directory protocols. A logically central but physically distributed directory coordinates sharing among the per-core caches, and each core cache must negotiate shared (read-only) or exclusive (read/write) access to each cache line via a coherence protocol. The use of directories poses its own challenges, however. Coherence traffic can be significant, which increases interconnect power, delay, and congestion; the performance of applications can suffer due to long latency between directories and requestors especially, for shared read/write data; finally, directory sizes must equal a significant portion of the combined size of the per-core caches, as otherwise directory evictions will limit performance [27]. Although some recent works propose more scalable directories or coherence protocols in terms of area and performance [16,18, 20,24,51], the scalability of directories to a large number of cores still remains an arguably critical challenge due to the design complexity, area overheads, etc. 1.3 Motivation for Fine-grained Thread Migration Under tiled CMPs, each core has its own cache slice and the last-level cache can be implemented either as private or shared; while the trade-offs between the two have been actively explored [12,62], many recent works have organized physically distributed L2 cache slices to form one logically shared L2 cache, naturally leading to a Non-Uniform Cache Access (NUCA) architecture [4,6,13,15,28,33,36]. And when large data structures that do not fit in a single cache are shared by multiple threads or iteratively accessed even by a single thread, the data are typically distributed across these multiple shared cache slices to minimize expensive off-chip accesses. This raises the need for a thread to access data mapped at remote caches often with high spatio-temporal locality, which is prevalent in many applications; for example, a database request might result in a series of phases, each consisting of many accesses 19 to contiguous stretches of data. Chunk 1 Chunk 4Chunk3Chk4 (a) Directory-based / RA-only (b) Thread migration Figure 1-1: Rationale of moving computation instead of data In a manycore architecture without efficient thread migration, this pattern results in large amounts of on-chip network traffic. Each request will typically run in a separate thread, pinned to a single core throughout its execution. Because this thread might access data cached in last-level cache slices located in different tiles, the data must be brought to the core where the thread is running. For example, in a directorybased architecture, the data would be brought to the core's private cache, only to be replaced when the next phase of the request accesses a different segment of data (see Figure 1-1a). If threads can be efficiently migrated across the chip, however, the on-chip data movement-and with it, energy use-can be significantly reduced; instead of transferring data to feed the computing thread, the thread itself can migrate to follow the data. When applications exhibit data access locality, efficient thread migration can turn many round-trips to retrieve data into a series of migrations followed by long stretches of accesses to locally cached data (see Figure 1-1b). And if the thread context is small compared to the data that would otherwise be transferred, moving the thread can be a huge win. Migration latency also needs to be kept reasonably low, and we argue that these requirements call for a simple, efficient hardware-level implementation of thread migration at the architecture level. 20 1.4 Motivation for Directoryless Architecture As described in Chapter 1.2, private Li caches need to maintain cache coherence to support shared memory, which is commonly done via distributed directory-based protocols in modern large-scale CMPs. One barrier to distributed directory coherence protocols, however, is that they are extremely difficult to implement and verify [35]. The design of even a simple coherence protocol is not trivial; under a coherence protocol, the response to a given request is determined by the state of all actors in the system, transient states due to indirections (e.g., cache line invalidation), and transient states due to the nondeterminism inherent in the relative timing of events. Since the state space explodes exponentially as the distributed directories and the number of cores grow, it is virtually impossible to cover all scenarios during verification either by simulation or by formal methods [63]. Unfortunately, verifying small subsystems does not guarantee the correctness of the entire system [3]. In modern CMPs, errors in cache coherence are one of the leading bug sources in the post-silicon debugging phase [22]. A straightforward approach to removing directories while maintaining cache coherence is to disallow cache line replication across on-chip caches (even Li caches) and use remote word-level access to load and store remotely cached data [23]: in this scheme, every access to an address cached on a remote core becomes a two-message round trip. Since only one copy is ever cached, coherence is trivially ensured. Such a remote-access-only architecture, however, is still susceptible to data access patterns as shown in Figure 1-la; each request to non-local data would result in a request-response pair sent across the on-chip interconnect, incurring significant network traffic and performance degradation. As a new design point, therefore, we propose a directoryless architecture which better exploits data locality by using fine-grained hardware-level thread migration to complement remote accesses [14,41]. In this approach, accesses to data cached at a remote core can also cause the thread to migrate to that core and continue execution there. When several consecutive accesses are made to data at the same core, thread 21 migration allows those accesses to become local, potentially improving performance over a remote-access regimen. Migration costs, however, make it crucial to migrate only when multiple remote accesses would be replaced to make the cost "worth it." Moreover, since only a few registers are typically used between the time the thread migrates out and returns, transfer costs can be reduced by not migrating the unused registers. In this thesis, we especially focus on how to make judicious decisions on whether to perform a remote access or to migrate a thread, and how to further reduce thread migration costs by only migrating the necessary thread context. 1.5 Previous Works on Thread Migration Migrating computation to accelerate data access is not itself a novel idea. Hector Garcia-Molina in 1984 introduced the idea of moving processing to data in memory bound architectures [26], and improving memory access latency via migration has been proposed using coarse-grained compiler transformations [32]. In recent years migrating execution context has re-emerged in the context of single-chip multicores. Michaud showed that execution migration can improve the overall on-chip cache capacity and selectively migrated sequential programs to improve cache performance [42]. Computation spreading [11] splits thread code into segments and migrates threads among cores assigned to the segments to improve code locality. In the area of reliability, Core salvaging [47] allows programs to run on cores with permanent hardware faults provided they can migrate to access the locally damaged module at a remote core. In design-for-power, Thread motion [49] migrates less demanding threads to cores in a lower voltage/frequency domain to improve the overall power/performance ratios. More recently, thread migration among heterogeneous cores has been proposed to improve program bottlenecks (e.g., locks) [34]. Moving thread execution from one processor to another has long been a common feature in operating systems. The 02 scheduler [9], for example, improves memory performance in distributed-memory multicores by trying to keep threads near their 22 data during OS scheduling. This OS-mediated form of migration, however, is far too slow to make migrating threads for more efficient cache access viable: just moving the thread takes many hundreds of cycles at best (indeed, OSes generally avoid rebalancing processor core queues when possible). In addition, commodity processors are simply not designed to support migration efficiently: while context switch time is a design consideration, the very coarse granularity of OS-driven thread movement means that optimizing for fast migration is not. Similarly, existing descriptions of hardware-level thread migration do not focus primarily on fast, efficient migrations. Thread Motion [49], for example, uses special microinstructions to write the thread context to the cache and leverages the underlying MESI coherence protocol to move threads via the last-level cache. The considerable onchip traffic and delays that result when the coherence protocol contacts the directory, invalidates sharers, and moves the cache line, is acceptable for the 1000-cycle granularity of the centralized thread balancing logic, but not for the fine-grained migration at the instruction level which is the focus of this thesis. Similarly, hardware-level migration among cores via a single, centrally scheduled pool of inactive threads has been described in a four-core CMP [10]; designed to hide off-chip DRAM access latency, this design did not focus on migration efficiency, and, together with the round-trips required for thread-swap requests, the indirections via a per-core spill/fill buffer and the central inactive pool make it inadequate for the fine-grained migration needed to access remote caches. 1.6 Contributions The specific contributions of this dissertation are as follows: 1. A directoryless architecture which supports fine-grained hardwarelevel thread migration to complement remote accesses (Chapter 2). Although thread (or process) movement has long been a common OS feature, the millisecond granularity makes this technique unsuitable for taking advantage of shorter-lived phenomena like fine-grained memory access locality. Based 23 on our pure hardware implementation of thread migration, we introduce a directoryless architecture where data mapped on a remote core can be accessed via a round-trip remote access protocol or by migrating a thread to where data resides. 2. A novel migration prediction mechanism which decides at instruction granularity whether to perform a remote access or a thread migration (Chapter 3). Due to high migration costs, it is crucial to use thread migrations judiciously under the proposed directoryless architecture. We, therefore, present an on-line algorithm which decides at the instruction level whether to perform a remote access or a thread migration. 3. Partial context thread migration to reduce migration costs (Chapter 4). We observe that not all the architectural registers are used while a thread is running on the migrated core, and therefore, always moving the entire thread context upon thread migrations is wasteful. In order to further cut down the cost of thread migration, we extend our prediction scheme to support partial context migration, a novel thread migration approach that only migrates the necessary part of the architectural state. 4. The 110-core Execution Migration Machine (EM 2 )-the silicon implementation to support hardware-level thread migration in a 45nm ASIC (Chapter 5). We provide the salient physical implementation details of our silicon prototype of the proposed architecture built as a 110-core CMP, which occupies 100mm 2 in 45nm ASIC technology. The EM 2 chip adopts the stack-based core architecture which is best suited for partial context migration, and it also implements the stack-variant migration predictor. We also present detailed evaluation results of EM 2 using the RTL-level simulation of several benchmarks on a full 110-core chip. Chapter 6 concludes the thesis with a summary of the major findings and suggestions for future avenues of research. 24 Relation to other publications. This thesis extends and summarizes prior publi- cations by the author and others. The deadlock-free fine-grained thread migration protocol was first presented in [14], and a directoryless architecture using this thread migration framework with remote access (cf. Chapter 2) was introduced in [40,41]. While these papers do not address deciding between migrations and remote accesses for each memory access, Chapter 3 subsumes the description of a migration predictor presented in [54]. The work is extended in Chapter 4 to support partial context migration by learning and predicting the necessary thread context. In terms of the EM 2 chip, the tapeout process was in collaboration with Mieszko Lis and Myong Hyon Cho, and the evaluation results of the RTL simulation in Chapter 5 were joint with Mieszko Lis; some of these contents, therefore, will also appear or has appeared in their theses. The physical implementation details of EM 2 and our chip design experience can also be found in [53]. 25 26 Chapter 2 Directoryless Architecture 2.1 Introduction For scalability reasons, large-scale CMPs (> 16 cores) tend towards a tiled architecture where arrays of replicated tiles are connected over an on-chip interconnect [7,52,58]. Each tile contains a processor with its own Li cache, a slice of the L2 cache, and a router that connects to the on-chip network. To maximize effective on-chip cache capacity and reduce off-chip access rates, physically distributed L2 cache slices form one large logically shared cache, known as Non-Uniform Cache Access (NUCA) architecture [13,28,36]. Under this Shared L2 organization of NUCA designs, the address space is divided among the cores in such a way that each address is assigned to a unique home core where the data corresponding to the address can be cached at the L2 level. At the Li level, on the other hand, data can be replicated across any requesting core since current CMPs use Private Li caches. Coherence at the Li level is maintained via a coherence protocol and distributed directories, which are commonly co-located with the shared L2 slice at the home core. To completely obviate the need for complex protocols and directories, a directoryless architecture extends the shared organization to Li caches-a cache line may only reside in its home core even at the Li level [23]. Because only one copy is ever cached, cache coherence is trivially ensured. To read and write data cached in a remote core, the directoryless architectures proposed and built so far use a remote access 27 mechanism wherein a request is sent to the home core and the resulting data (or acknowledgement) is sent back to the requesting core. In what follows, we describe this remote access protocol, as well as a protocol based on hardware-level thread migration where instead of making a round-trip remote access the thread simply moves to the core where the data resides. We then present a framework that combines both. 2.2 Remote Cache Access Under the remote-access framework of directoryless designs [23, 36], all non-local memory accesses cause a request to be transmitted over the interconnect network, the access to be performed in the remote core, and the data (for loads) or acknowledgement (for writes) to be sent back to the requesting core: when a core C executes a memory access for address A, it must 1. find the home core H for A (e.g., by consulting a mapping table or masking some address bits); 2. if H = C (a core hit), (a) forward the request for A to the cache hierarchy (possibly resulting in a DRAM access); 3. if H # C (a core miss), (a) send a remote access request for address A to core H; (b) when the request arrives at H, forward it to H's cache hierarchy (possibly resulting in a DRAM access); (c) when the cache access completes, send a response back to C; (d) once the response arrives at C, continue execution. Note that, unlike a private cache organization where a coherence protocol (e.g., directory-based protocol) takes advantage of spatial and temporal locality by making 28 a copy of the block containing the data in the local cache, this protocol incurs a round-trip access for every remote word. Each load or store access to an address cached in a different core incurs a word-granularity round-trip message to the core allowed to cache the address, and the retrieved data is never cached locally (the combination of word-level access and no local caching ensures correct memory semantics). 2.3 Hardware-level Thread Migration We now describe fine-grained, hardware-level thread migration, which we use to better exploit data locality for our directoryless architecture. This mechanism brings the execution to the locus of the data instead of the other way around: when a thread needs access to an address cached on another core, the hardware efficiently migrates the thread's execution context to the core where the data is (or is allowed to be) cached. If a thread is already executing at the destination core, it must be evicted and moved to a core where it can continue running. To reduce the need for evictions and amortize migration latency, cores duplicate the architectural context (register file, etc.) and allow a core to multiplex execution among two (or more) concurrent threads. To prevent deadlock, one context is marked as the native context and the other as the guest context: a core's native context may only hold the thread that started execution there (called the thread's native core), and evicted threads must return to their native cores to ensure deadlock freedom [14]. Briefly, when a core C running thread T executes a memory access for address A, it must 1. find the home core H for A (e.g., by consulting a mapping table or masking the appropriate bits); 2. if H = C (a core hit), (a) forward the request for A to the local cache hierarchy (possibly resulting in a DRAM access); 29 3. if H # C (a core miss), (a) interrupt the execution of the thread on C (as for a precise exception), (b) unload the execution context (microarchitectural state) and convert it to a network packet (as shown in Figure 2-1), and send it to H via the on-chip interconnect: i. if H is the native core for T, place it in the native context slot; ii. otherwise: A. if the guest slot on H contains another thread T', evict T' and migrate it to its native core N' B. move T into the guest slot for H; (c) resume execution of T on H, requesting A from its cache hierarchy (and potentially accessing backing DRAM or the next-level cache). When an exception occurs on a remote core, the thread migrates to its native core to handle it. Although the migration framework requires hardware changes to the baseline directoryless design (since the core must be designed to support efficient migration), it migrates threads directly over the interconnect, which is much faster than other thread migration approaches (such as OS-level migration or Thread Motion [49], which leverage the existing cache coherence protocol to migrate threads). 2.4 Performance Overhead of Thread Migration Since the thread context is directly sent across the network, the performance overhead of thread migration is directly affected by the context size. The relevant architectural state that must be migrated in a 64-bit x86 processor amounts to about 3.lKbits (sixteen 64-bit general-purpose registers, sixteen 128-bit floating-point registers and special purpose registers), which is what we use in this thesis. The context size will vary depending on the architecture; in the TILEPro64 [7], for example, it amounts 30 Tile Core Depacketizerj (Context Load) No(Context Incoming Queue Register File Packetizer Unload) Outgoing Queue Interconnect Network Figure 2-1: Hardware-level thread migration via the on-chip interconnect to about 2.2Kbits (64 32-bit registers and a few special registers). This introduces a serializationlatency since the full context needs to be loaded (unloaded) into (from) the network: with 128-bit flit network and 3.1Kbits context size, this becomes - pkt size Iflit sizeI 26 flits, incurring the serialization overhead of 26 cycles. With a 64-bit register file with two read ports and two write ports, one 128-bit flit can be read/written in one cycle and thus, we assume no additional serialization latency due to the lack of ports from/to the thread context. Another overhead is the pipeline insertion latency. Since a memory address is computed at the end of the execute stage, if a thread ends up migrating to another core and re-executes from the beginning of the pipeline, it needs to refill the pipeline. In case of a typical five-stage pipeline core, this results in an overhead of three cycles. To make fair performance comparisons, all these migration overheads are included as part of execution time for architectures that use thread migrations, and their values are specified in Table 3.1. 2.5 Hybrid Memory Access Framework We now propose a hybrid architecture by combining the two mechanisms described: each core-miss memory access may either perform the access via a remote access as in Section 2.2 or migrate the current execution thread as in Section 2.3. This architecture 31 Access memory & continue execution Migrate yes Memory access -in core C Address cacheable in core C? no Migrate thread to home core Access exc eded? I : Access memory & continue execution I Send remote Remote Migrate another thread back to its native core ys Access memory request to home core 4 I Continue execution Return data (read) or ack (write) to - the requesting core C Core originating memory access Network Core where address can be cached Figure 2-2: Hybrid memory access framework for our directoryless architecture is illustrated in Figure 2-2. For each access to memory cached on a remote core, a decision algorithm determines whether the access should migrate to the target core or execute a remote access. Because this decision must be taken on every access, it must be implementable as efficient hardware. In our design, an automatic predictor decides between migration and remote access on a per-instruction granularity. It is worthwhile to mention that we allow replication for instructions since they are read-only; threads need not perform a remote access nor migrate to fetch instructions. predictor in the next chapter. 32 We describe the design of this Chapter 3 Thread Migration Prediction 3.1 Introduction Under the remote-access-only architecture, every core-miss memory access results in a round-trip remote request and its reply (data word for load and acknowledgement for store). Therefore, migrating a thread can be beneficial when several memory accesses are made to the same core: while the first access incurs the migration costs, the remaining accesses become local and are much faster than remote accesses. Since thread migration costs exceed the cost required by remote-access-only designs on a per-access basis due to a large thread context size, the goal of the thread migration predictor is to judiciously decide whether or not a thread should migrate: since migration outperforms remote accesses only for multiple contiguous memory accesses to the same location, our migration predictor focuses on detecting those. 3.2 3.2.1 Thread Migration Predictor Per-core Thread Migration Predictor Since the migration/remote-access decision must be made on every memory access, the decision mechanism must be implementable as efficient hardware. To this end, we will describe a per-core migration predictor-aPC-indexed direct-mapped data structure 33 RegFilel PCetch PC2 FH Decode ARegFile2 Execute acheable? (Core hit)- No (Core miss) Hid tr Yes Write back Memory N Ioa~cs I Proceed to 3Mmoystg iMemory stage Remote Access Thread Migration Figure 3-1: Hybrid memory access architecture with a thread migration predictor on a 5-stage pipeline core. where each entry simply stores a PC. The predictor is based on the observation that sequences of consecutive memory accesses to the same home core are highly correlated with the program flow, and that these patterns are fairly consistent and repetitive across program execution. Our baseline configuration uses 128 entries; with a 64-bit PC, this amounts to about 1KB total per core. The migration predictor can be consulted in parallel with the lookup of the home core for the given address. If the home core is not the core where the thread is currently running (a core miss), the predictor must decide between a remote access and a thread migration: if the PC hits in the predictor, it instructs a thread to migrate; if it misses, a remote access is performed. Figure 3-1 shows the integration of the migration predictor in a hybrid memory access architecture on a 5-stage pipeline core. The architectural context (RegFile2 and PC2) is duplicated to support deadlock-free thread migration (cf. Section 2.3); the shaded module is the component of migration predictor. In the next section, we describe how a certain instruction (or PC) can be detected as "migratory" and thus inserted into the migration predictor. 34 3.2.2 Detecting Migratory Instructions: WHEN to migrate At a high level, the prediction mechanism operates as follows: 1. when a program first starts execution, it runs as the baseline directoryless architecture which only uses remote accesses; 2. as it continues execution, it monitors the home core information for each memory access, and 3. remembers the first instruction of every multiple access sequence to the same home core; 4. depending on the length of the sequence, the instruction address is either inserted into the migration predictor (a migratory instruction) or is evicted from the predictor (a remote-access instruction); 5. the next time a thread executes the instruction, it migrates to the home core if it is a migratory instruction (a "hit" in the predictor), and performs a remote access if it is a remote-access instruction (a "miss" in the predictor). The detection of migratory instructions which trigger thread migrations can be easily done by tracking how many consecutive accesses to the same remote core have been made, and if this count exceeds a threshold, inserting the PC into the predictor to trigger migration. If it does not exceed the threshold, the instruction is classified as a remote-access instruction, which is the default state. Each thread tracks (1) Home, which maintains the home location (core ID) for the current requested memory address, (2) Depth, which indicates how many times so far a thread has contiguously accessed the current home location (i.e., the Home field), and (3) Start PC, which tracks the PC of the very first instruction among memory sequences that accessed the home location that is stored in the Home field. We separately define the depth threshold 0, which indicates the depth at which we determine the instruction as migratory. The detection mechanism is as follows: when a thread T executes a memory instruction for address A whose PC = P, it must 35 1. find the home core H for A (e.g., by consulting a mapping table or masking the appropriate bits); 2. if Home = H (i.e., memory access to the same home core as that of the previous memory access), (a) if Depth < 0, increment Depth by one; 3. if Home 7 H (i.e., a new sequence starts with a new home core), (a) if Depth = 0, StartPC is considered a migratory instruction and thus inserted into the migration predictor; (b) if Depth < 0, StartPC is considered a remote-access instruction; (c) reset the entry (i.e., Home = H, PC = P, Depth = 1). Memory Instruction PC Home Core Home Present State Depth Start PC Home Next State Depth Start PC Action I, : PC, A - - - A I PC, Reset the entry for a new sequence starting from PC, 1: PC, B A I PC, B I PC, Reset the entry for a new sequence starting from PC 2 (evict PC, from the predictor, if exists) 1,: PC, C B I PC, C I PC, Reset the entry for a new sequence starting from PC3 1,: 1, : I": 17: PC 4 C C I PC, C 2 PC, Increment the depth by one PC, C C 2 PC, C 2 PC, Do nothing (threshold already reached) PC, C C 2 PC 3 C 2 PC3 Do nothing (threshold already reached) PC, A C 2 PC3 A I PC, Insert PC, into the migration predictor (evict PC2 from the predictor, if exists) Reset the entry for a new sequence starting from PC7 Figure 3-2: An example of how instructions (or PC's) that are followed by consecutive accesses to the same home location (i.e., migratory instructions) are detected in the case of the depth threshold 0 = 2. Figure 3-2 shows an example of the detection mechanism when 0 = 2. Setting 0 = 2 means that a thread will perform remote accesses for "one-off" accesses and will migrate for multiple accesses (> 2) to the same home core. Suppose a thread executes 'Since all instructions are initially considered as remote-accesses, setting the instruction as a remote-access instruction will have no effect if it has not been classified as a migratory instruction. If the instruction was migratory (i.e., its PC is in the predictor), however, it reverts back to the remote-access mode by invalidating the corresponding entry of the migration predictor. 36 a sequence of memory instructions, I1 1I7 (non-memory instructions are ignored in this example because they do not change the entry content nor affect the mechanism). The PC of each instruction from 1 to 17 is PC1 , PC2 , ... PC7, respectively, and the home core for the memory address that each instruction accesses is specified next to each PC. When 1 is first executed, the entry {Home, Depth, Start PC} will hold the value of {A, 1, PC 1 }. Then, when 12 is executed, since the home core of I2 (B) is different from Home which maintains the home core of the previous instruction I1 (A), the entry is reset with the information of '2. Since the Depth to core A has not reached the depth threshold, PC1 is considered a remote-access instruction (default). The same thing happens for 13, setting PC2 as a remote-access instruction. Now when 14 is executed, it accesses the same home core C and thus only the Depth field needs to be updated (incremented by one). For 15 and I6 which keep accessing the same home core C, we need not update the entry because the depth has already reached the threshold 0, which we assumed to be 2. Lastly, when 17 is executed, since the Depth to core C has reached the threshold, PC3 in the Start PC field, which represents the first instruction (13) that accessed this home core C, is classified as a migratory instruction and thus is added to the migration predictor. Finally, the predictor resets the entry and starts a new memory sequence starting from PC7 for the home core A. When an instruction (or PC) that has been added to the migration predictor is encountered again, the thread will directly migrate instead of sending a remote request and waiting for a reply. Suppose the example sequence I, ~ 17 we used in Figure 3-2 is repeated as a loop (i.e., I1, I2, ... 17, 1,, ... ) by a thread originating at core A. Under a standard, remote-access-only architecture where the thread will never leave its native core A, every loop will incur five round-trip remote accesses; among seven instructions from 1 to I7, only two of them (I, and 17) are accessing core A which result in core hits. Under our migration predictor with 0 = 2, on the other hand, PC3 and PC7 will be added in the migration predictor and thus the thread will now migrate at 13 and 17 in the steady state. As shown in Figure 3-3, every loop incurs two migrations, turning 14, 15, and I6 into core hits (i.e., local accesses) at core C: overall 4 out of 7 memory accesses complete locally. The benefit of migrating a thread 37 PC3 ... A ........ . (b) The thread migrates when it encounters 13 since it hits in the migration predictor. (a) I2 is served via a remote-access since its PC, PC2 , is not in the migration predictor. /WA B PC A (d) On I7, the thread migrates back to core A. Overall, two migrations and one remote access are incurred for a single loop. (c) By migrating the thread to core C, three successive accesses to core C (14, 15 and 16) now turn into local memory accesses. Figure 3-3: An example of how the decision between remote access and thread migration is made for every memory access. becomes even more significant with a longer sequence of successive memory accesses to the same non-native core (core C in this example). 3.2.3 Possible Thrashing in the Migration Predictor Since we use a fixed size data structure for our migration predictor, collisions between different migratory PCs can result in suboptimal performance. While we have chosen a size that results in good performance, some designs may need larger (or smaller) predictors. Another subtlety is that mispredictions may occur if memory access patterns for the same PC differ across two threads (one native thread and one guest thread) running on the same core simultaneously because they share the same per-core predictor and may override each other's decisions. Should this interference become significant, it can be resolved by implementing two predictors instead of one per core-one for the native context and the other for the guest context. 38 In our set of benchmarks, we rarely observed performance degradation due to these collisions and mispredictions with a fairly small predictor (about 1KB per core) shared by both native and guest context. This is because each worker thread executes very similar instructions (althoiTgh on different data) and thus, the detected migratory instructions for threads are very similar. While such application behavior may keep the predictor simple, however, our migration predictor is not restricted to any specific applications and can be extended if necessary as described above. It is important to note that even if a rare misprediction occurs due to either predictor eviction or interference between threads, the memory access will still be carried out correctly, and the functional correctness of the program is still maintained. 3.3 Experimental Setup We use Pin [5] and Graphite [43] to model the proposed hybrid architecture that supports both remote-access and thread migration. Pin enables runtime binary instrumentation of parallel programs; Graphite implements a tile-based multicore, memory subsystem, and network, modeling performance and ensuring functional correctness. The default system parameters are summarized in Table 3.1. Parameter Settings Cores 64 in-order, 5-stage pipeline, single issue cores, 2-way fine-grain multithreading 32/128 KB, 2/4-way set associative, 64B block 2D Mesh, XY routing, 2 cycles per hop (+ contention), 128b flits 3.1 Kbits full execution context size, Full context = 26 cycles load/unload latency: Iflit sizeI = Pipeline insertion latency = 3 cycles First-touch after initialization, 4 KB page size L1/L2 cache per core Electrical network Migration Overhead Data Placement Table 3.1: System configurations used Experiments were performed using Graphite's model of an electrical mesh network with XY routing with 128-bit flits. Since modern NoC routers are pipelined [19], and 39 2- or even 1-cycle per hop router latencies [38] have been demonstrated, we model a 2-cycle per-hop router delay; we also account for the pipeline latencies associated with loading/unloading packets onto the network. In addition to the fixed per-hop latency, we model contention delays using a probabilistic model as in [37]. For data placement, we use the first-touch after initialization policy which allocates the page to the core that first accesses it after parallel processing has started. This allows private pages to be mapped locally to the core that uses them, and avoids all the pages being mapped to the same core where the main data structure is initialized before the actual parallel region starts. 3.3.1 Application Benchmarks Our experiments use a parallel perceptron cross-validation (prcn+cv) benchmark and a set of Splash-2 [60] benchmarks with the recommended input set for the number of cores used2 : fft, lu-contiguous, ocean-contiguous, radi?, raytrace and water-nsq. Chunk 1 Chunk 2 Experiment 1 Chunk 3 Chunk 4 Training data Parallel execution (Each thread runs a separate Experiment 2 experiment, which sequentially Experiment 3 Experiment 4 trains the model with (K-1) data chunks and test with the last chunk) Train Total data spread across L2 cache slices ( Data chunk i is mapped to Core i ) Figure 3-4: Parallel K-fold cross-validation using perceptron Parallel cross-validation (prcn+cv) is a popular machine learning technique for optimizing model accuracy. In the k-fold cross-validation, as illustrated in Figure 3-4, data samples are split into k disjoint chunks and used to run k independent leave-one2 Some were not included due to simulation issues. Unlike other Splash-2 benchmarks, radix was originally filling an input array with random numbers (not a primary part of radix-sort algorithm) in the parallel region; thus, we moved the initialization part prior to spawning worker threads so that the parallel region solely performs the actual sorting. 3 40 out experiments. Each thread runs a separate experiment, which sequentially trains the model with k - 1 data chunks (training data) and tests with the last chunk (test data). The results of k experiments are used either to better estimate the final prediction accuracy of the algorithm being trained, or, when used with different parameter values, to pick the parameter that results in the best accuracy. Since the experiments are computationally independent, they naturally map to multiple threads. Indeed, for sequential machine learning algorithms, such as stochastic gradient descent, this is the only practical form of parallelization because the model used in each experiment is necessarily sequential. The chunks are typically spread across the shared cache shards, and each experiment repeatedly accesses a given chunk before moving on to the next one. Our set of Splash-2 benchmarks are slightly modified from their original versions: while both the remote-access-only baseline and our proposed architecture do not allow replication for any kinds of data at the hardware level, read-only data can actually be replicated without breaking cache coherence even without directories and a coherence protocol. We, therefore, applied source-level read-only data replication to these benchmarks; more details on this can be found in Appendix A. Our optimizations were limited to rearranging and replicating some data structures (i.e., only tens of lines of code changed) and did not alter the algorithm used; automating this replication is outside of the scope of this work. It is important to note that both the remoteaccess-only baseline and our hybrid architecture benefit almost equally from these optimizations. Each application was run to completion; for each simulation run, we measured the core miss rate, the number of core-miss memory accesses divided by the total number of memory accesses. Since each core-miss memory access must be handled either by remote access or by thread migration, the core miss rate can further be broken down into remote access rate and migration rate. For the baseline remote-access-only architecture, the core miss rate equals the remote access rate (i.e., no migrations); for our hybrid design, the core miss rate is the sum of the remote access rate and the migration rate. For performance, we measured the parallel completion time (the 41 longest completion time in the parallel region). Migration overheads (cf. Chapter 2.4) for our hybrid architecture are taken into account. 3.3.2 Evaluated Systems Since our primary focus in this chapter is to improve the capability of exploiting data locality at remote cores by using thread migrations judiciously, we compare our hybrid 4 directoryless architecture against the remote-access-only directoryless architecture . We refer to the directoryless, remote-access-only architecture as NoDirRA and the hybrid architecture with our migration predictor as NoDirPred-Full.The suffix of -Full means that the entire thread context is always migrated upon thread migrations. 3.4 3.4.1 Simulation Results Performance We first compare the core miss rates for a directoryless system without and with thread migration: the results are shown in Figure 3-5. The depth threshold 9 is set to 3 for our migration predictor, which aims to perform remote accesses for memory sequences with one or two accesses and migrations for those with > 3 accesses to the same core. Although we have evaluated our system with different values of 0, we consistently use 9 = 3 here since increasing 9 only makes our hybrid design converge to the remote-access-only design and does not provide any further insight. While 21% of total memory accesses result in core misses for the remote-accessonly design on average, the directoryless architecture with our migration predictor results in a core miss rate of 6.7%, a 68% improvement in data locality. Figure 3-5 also shows the fraction of core miss accesses handled by remote accesses and thread migrations in our design. We observe that a large fraction of remote accesses are successfully replaced with a much smaller number of migrations. For example, prcn+cv shows the best scenario where it originally incurred a 87% remote access rate under 4 The performance comparison against a conventional directory-based scheme is provided in Chapter 4. 42 a remote-access-only architecture, which dropped to 0.8% with a small number of migrations. Across all benchmarks, the average migration rate is only 1% resulting in 68% fewer core misses overall. 50 87.3 " Remote Acc ess (NoDirRA) 4U - Remote Access (NoDirPred-Full) 30 - Migration (NoDirPred-Full) E 20 10 [ -- 0 - 0.0 Figure 3-5: Core miss rate and its breakdown into remote access rate and migration rate This improvement of data locality relates to better performance for our directoryless architecture with thread migration as shown in Figure 3-6. For our set of benchmarks, our proposed system shows 25% better performance on average (geometric mean) across all benchmarks; when excluding prcn+cv for reference, the average performance improvement is 6%. However, due to the relatively large thread context size, the network traffic overhead of thread migrations can be significant. Figure 3-7 shows on-chip network traffic, which is measured as the number of flits sent times the number of hops traveled. Except for prcn+cv, we observe that NoDirPred-Fullactually incurs more network traffic than NoDirRA (even with small migration rates). Therefore, in order to make the architecture more viable, we believe that reducing the migration costs is critical, which is addressed in the next chapter. 43 1 ( NoDirRA 0 NoDirPred-FullI 0.6 0 L2~ 0.4 5 0.2 0 Cg+, 'X0 Figure 3-6: Parallel completion time normalized to the remote-access-only architecture (NoDirRA) 3.5 Chapter Summary In this chapter, we presented an on-line, PC-based thread migration predictor for our directoryless architecture that uses thread migration and remote access. Our results show that migrating threads for sequences of multiple accesses to the same core can improve data locality in directoryless designs, and with our predictor, it can result in better performance compared to the baseline design which only relies on remote accesses. However, we observed that the high network traffic overhead of thread migration remains since the entire thread context is always being migrated. Therefore, we need to further reduce the migration costs, which is achieved by partial context thread migration described in the next chapter. 44 3 NoDirRA S3.5 3 2.5 S2 o 1.5--- - --- U NoDirPred-Full -- _--- ___ - - -- 0 0.5 4?1- Figure 3-7: Network traffic normalized to the remote-access-only architecture (NoDirRA) 45 46 Chapter 4 Partial Context Migration for General Register File Architecture 4.1 Introduction We can further reduce the cost of thread migrations by sending only a part of the register file when a thread migrates. This is based on the observation that only some of the registers are usually used between the time the thread migrates out of its native core and the time it migrates back; therefore, if this subset of registers can be accurately predicted when the thread migrates, migration costs can be cut down significantly. Therefore, in this chapter, we present partial context migration; its goal is to predict which registers will be read/written while the thread is away from its native core, and to migrate only those. Implementing such partial context migration requires the core architecture to support 1) partial loading and unloading of the thread context, 2) the capability to predict which part of the context will be used at the migrated core, and 3) a mechanism to handle misprediction. These are discussed in details below. 47 Tile Register Mask Depacketizer (Context Load) Core Program Counter Register Mask Packetizer 10(Context Register File incoming Queue Unload JOutgoing Queue interconnect Network Figure 4-1: Hardware-level thread migration with partial context migration support 4.2 4.2.1 Partial Context Thread Migration Extending Migration Predictor Figure 4-1 shows a hardware architecture to support partialcontext migration: during a thread migration, a packetizer (or a depacketizer) decodes a register mask, a bit-vector where each bit represents whether or not the corresponding register is to be migrated; registers whose corresponding bits are set in the register mask will be unloaded onto the network (or loaded from the network). With the deadlock-free thread migration framework described in [14], even though a thread migrates away from its native core, the native-core register file remains intact in its native context since it is not used by any other guest threads; this allows us to carry out only the registers read "on the trip" and bring back only the registers written while away. We now extend our migration predictor; we observe that not only sequences of consecutive memory accesses to the same home core but also register usage patterns within those sequences are highly correlated with the program (instruction) flow. Our baseline configuration uses a 128-entry predictor, each of which consists of a 64-bit PC and a 32-bit register mask, which amounts to about 1.5KB total'. Our extended 1An N-bit mask is required for an architecture with N general register file registers where each bit indicates whether or not the corresponding register needs to be sent in case of migrations. In this paper, we use N = 32, which accounts for 16 64-bit registers (rdi, rsi, rbp, rsp, rbx, rdx, rcx, rax and 48 PC 57 Valid 7 Tag Useful Register Mask rdi rsi 0 rbp 0 --- r15 xmmO 1 0 32 SHit = Migrate 1 xmml5 0 Registers with 1's are sent. (Only used when a thread is migrating from its native core.) Figure 4-2: A per-core PC-based migration predictor, where each entry contains a {PC, register mask} pair. migration predictor is shown in Figure 4-2. Our original predictor decides between a remote access and a thread migration upon a core miss. With the partial context migration support, moreover, if the thread is migrating from its native core to another core, only the registers whose corresponding bits in the register mask are set will be transferred 2 . This register mask field is only used when a thread leaves its native core and not when it migrates from a non-native core to another non-native core, or when it migrates back to its native core from outside. In the next section, we describe how the predictor stores the used-registers information for each migratory instruction. 4.2.2 Detection of Useful Registers: WHAT to migrate We now extend our migration predictor to support partial context migrations by predicting which registers need to be sent for each migration and sending only those. This requires each thread to keep track of which registers have been read/written within a sequence of memory instructions accessing the same home core; this can r8 to r15) and 16 128-bit XMM registers (xmmO to xmm15). 2 Special purpose registers such as rip, rflags and mxcsr are always transferred and thus are not included in the 32-bit register mask. 49 Present State Next State Home Depth Start PC Used Regs Home Depth Start PC Used Regs Instruction PC 1, PC, Home Core Regs C ri - - I,: PC, - r2. r3 C I 13 PC3 C r2 C I 14: PC.4 - r4 C 11 PC5 A rl C I PC, C I PC, r1. r2, r3 Set the used register bits for r2 and r3 C 2 PC, r I, r2, r3 Increment the depth by one C 2 PC, A 1 PC5 - C PC, r1 PC, rI, r2. r3 2 PC, r, r2, r3 2 PC, - rl, r2. r3,r4 Action rl Reset the entry for a new sequence and set the used register bit for rI r, r2, r3,r4 Set the used register mask bit for r4 r] Insert PC, into the migration predictor with the register mask (rl,r2,r3and r4 are set) Figure 4-3: An example how registers being read/written are kept track of and how the information is inserted into the migration predictor when a specific instruction (or PC) is detected as a migratory instruction (the depth threshold 0 = 2). be easily implemented on top of the mechanism we described in Chapter 3.2.2. In addition to (1) Home, (2) Depth, and (3) Start PC, each thread now also tracks (4) Used Registers, which is a 32-bit vector where each bit indicates whether the corresponding register has been used or not. Every instruction (both memory and non-memory) updates this Used Registers field by setting the bit when the corresponding register is being read or written. It may seem that registers which are only written and not read while a thread is away from its native core may not have to be transferred because they will be written anyways. This is true when the ISA does not support partial registers. In our design, however, we assume registers can be partially read or written, and thus, we treat these registers as a necessary part of the migration context to simplify managing the case of writing into a partial register. When the PC is detected as a migratory instruction and thus inserted into the migration predictor (cf. Chapter 3.2.2), the Used Registers field is inserted together with Start PC into the Useful Register Mask in the migration predictor (see Figure 4-2). Figure 4-3 shows an example of the detection mechanism when 0 a thread executes a sequence of instructions, I1 - 15. = 2. Suppose I1, I3 and 15 are memory instructions, 12 and 14 are non-memory instructions, and r, denotes the nth register. When I, is first executed, the entry {Home, Depth, Start PC, Used Registers} will 50 hold the value of {C, 1, PC1 , r1}. Then, when I2, a non-memory instruction using r2 and r3, is executed, the Used Registers bit-vector is updated to set the bits for r2 and r3. When 13 is executed, it accesses the same home core C and thus the Depth field is incremented by one; r2 is already included in the used register bit-vector, so its value does not change. 14 simply adds r4 to the register bit-vector and lastly, when 15 is executed, since the Depth to core C has reached the threshold, PC1 in the Start PC field is added to the migration predictor with the register mask bits. The migration predictor will now contain a {PC, Useful Register Mask} pair, which allows a thread to predict the useful registers from the time the thread migrates out from its native core until it migrates back to its native core. 4.2.3 Partial Context Migration Policy The partial context migration policy is as follows (each case is illustrated in Figure 44): when a thread T executes a memory instruction whose PC hits in the migration predictor and thus needs to migrate, 1. if T is migrating from its native core to a non-native core, it takes the registers specified in the Useful Register Mask of the migration predictor (cf. Figure 4-4a); 2. if T is migrating from a non-native core to another non-native core, it takes all the registers that T brought when T first migrated out from its native core (cf. Figure 4-4b); 3. if T is migrating back to its native core from a non-native core, it takes only the registers that are written while T was outside from its native core (cf. Figure 4-4c). 4. Special purpose registers required for the thread execution (e.g., rip, rflags and mxcsr for a 64-bit x86 architecture) are always transferred. In order to implement these policies, a thread carries around two 32-bit masks: V-mask and W-mask. V-mask identifies the registers that the thread may access while 51 P, rl, r4 Migrates with r], r2, r3 Bf ~~~ V: { rl, r2, r3} Migrates with r], r2, r3 W:. rl} (b) Since only ri, r2 and r3 have been brought from its native core, the V-mask will only contain these three registers. For a migration from core C to core D, both non-native cores, only the registers in the (a) Suppose a thread originated at core A (i.e., core A is its native core). When it migrates due to the hit in the migration predictor, it only takes the registers specified in the useful register mask field of the predictor. V-mask are migrated. Migrates with r] Migrates with r] V : { fl, r2, r} 1W:{rl} PC, add r], r3, r4 register miss (r4) A r4 rC V:{rl, r2, .. r3} W:{Irl} (c) Suppose the register ri has been written while the thread is outside of its native core; the W-mask contains ri. When the thread migrates back to its native core, it only brings this register ri in the W-mask back. (d) While the thread is running at a nonnative core (core D), the register miss can happen if it wants an access to a register that is not in the V-mask. If this happens, the thread migrates back to its native core, only with the written registers. Figure 4-4: An example of a partial context thread migration. outside of its native core (looked up in the predictor when the thread first migrated out from its native core). W-mask keeps track of the registers that have been written while outside the native core, and is used to implement policy (3). Since a register file remains intact in the native context, a thread returning to its native core needs to carry only the registers that have been modified. During migrations, these two masks (64 bits in total) and {Home, Depth, Start PC, Used Registers} must be transferred together with the 3.1Kbit context (cf. Chapter 2.4). With 64 cores (6 bits for the home core ID), a maximum depth threshold of 8 (3 bits), a 64-bit Start PC and a 32-bit used register mask, a total of 169 bits have to be transferred in addition to the context. 52 It is important to note that unlike the decision on whether to perform a remote access or a thread migration, the useful register information in the migration predictor is only consulted by a thread when at its native core; this is because the native context is the only place where all the register values are maintained for the thread, and once it leaves the native core, the thread cannot use any registers other than the ones it initially brought from its native core (i.e., registers in V-mask). 4.2.4 Misprediction handling This makes it possible for a thread to encounter an instruction which requires a specific register r, which has not been brought from its native core while outside its native core (i.e., r, V-mask); we call this a register miss. A register miss can happen, for example, when the program flow changes due to branches and conditional jumps resulting in a different sequence of instructions being executed. When a register miss occurs, the thread stops its execution (just like when a core miss occurs), and returns to its native core (cf. Figure 4-4d). Our migration predictor tries to minimize migrations caused by register misses; therefore, we update the useful register mask in the migration predictor by adding the register that caused the register miss when the thread migrates back. With this learning mechanism, the useful register mask for a particular PC, PC1 , will eventually converge to a superset of registers that are used after the thread migrates at PC1 until it migrates back to its native core. We show the overhead of register misses and how much the network traffic can be reduced using partial context migrations in Chapter 4.4. 4.3 Experimental Setup We use Graphite [43] to model the proposed directoryless architecture that supports both remote-access and partial context thread migration. The same system parameters as the previous chapter are used (cf. Table 3.1). 53 4.3.1 Evaluated Systems We compare our hybrid directoryless architecture with migration predictor (NoDirPred) against the remote-access-only directoryless baseline (NoDirRA). To see how well the predictor itself works, we also compare with a simple DISTANCE decision scheme (NoDirDist) previously proposed by [41]: the intuition here is that over short distances the round-trip remote-access overhead is low, so threads migrate only if the distance to the home core exceeds some threshold d. We use d = 6, the average hop count for an 8x8 mesh, and transfer the full context during migrations. We also present the result for a directory-based cache-coherence architecture (DirCC) to provide a sense of how directoryless designs perform compared to conventional designs. DirCC uses the MSI protocol with distributed full-map directories on a Private-Li SharedL2 configuration. This makes for an apples-to-apples comparison between directory schemes and directoryless designs because using the shared-L2 configuration with the same data placement policy results in negligible differences in off-chip access rates across all the systems we evaluate; the main performance gap stems from the performance of on-chip cache accesses. 4.4 4.4.1 Simulation Results Performance and Network Traffic We compare the overall performance among DirCC, NoDirRA, NoDirDist, and NoDirPred;the results are shown in Figure 4-5. For NoDirPred,the depth threshold 0 is set to 3. Although we have evaluated our system with different values of 0, we consistently use 0 = 3 here since increasing 0 only makes our hybrid design converge to the remote-access-only design and does not provide any further insight. When compared to DirCC,NoDirRA performs worse by 59% on average, while our hybrid architecture (NoDirPred) performs worse by 18% on average. NoDirDist performs the worst, indicating that migration decisions must be made judiciously. Since I-cache content is not transferred during migrations, NoDirPredshows 7% more I-cache misses 54 NoDirRA 0 NoDirPred DirCC " NoDirDist U " 4 E p 3 2 E 0 1 0 Ne 4? Figure 4-5: Parallel completion time normalized to DirCC than NoDirRA on average; I-cache miss rates, however, are still very low (mostly < 0.1%) and have negligible effect on performance. We also compare on-chip network traffic in each system, measured as the number of flits sent times the number of hops traveled. Figure 4-6 shows that NoDirPred reduces network traffic by 24% on average compared to DirCC,and by 55% when compared to NoDirRA; while not shown in the figure, the network traffic for NoDirDist is prohibitive, 6x more traffic on average compared to DirCC. Although the average performance of NoDirPredis less than that of DirCC,it is important to note that most of the benchmarks we are using were originally developed with directory coherence in mind. Parallel cross-validation with the perceptron learning algorithm (prcn+cv) is an example where directory-based coherence does not work well; the computation requires each thread to traverse through a dataset spread across the cores, resulting in many accesses to remote caches and high network overhead for DirCC.As a result, NoDirPredoutperforms DirCC by 34% with 42x less traffic for prcn+cv, demonstrating that such overhead can be eliminated by migrating threads to the data. To better understand the overall performance, we measured Li cache miss rates for DirCC and NoDirPred;the results are shown in Figure 4-7. Since cache lines are 55 U DirCC U NoDirRA U NoDirPred 6 .S~ 5 3 0 0 Figure 4-6: Network traffic normalized to DirCC not replicated across Li caches in the directoryless design (NoDirPred),the effective Li cache capacity increases, always resulting in lower Li miss rates than DirCC; more important, while all Li misses under NoDirPred are forwarded to local L2 caches, a large fraction of Li misses for DirCC result in memory requests to remote L2 caches, a major factor in performance degradation and network traffic for directory-based architecture. On the other hand, directoryless designs can suffer when the core miss rate is high, i.e., when frequently accessing data cached in remote cores; the core miss rate of DirCC is always zero. Figure 4-8 shows that on average, 21% of total memory accesses result in core misses for NoDirRA, which drops to only 6.6% for NoDirPred. While not shown, this improvement is achieved with the average migration rate of 1%, indicating that the predictor works well. Raytrace and water are examples where NoDirPred suffers in terms of both performance and network traffic due to high core miss rates. In order to track how traffic is reduced by partial context migration, we compare our design with the full context migration variant, which always sends the full thread context during migrations (NoDirPred-Full). The results are shown in Figure 4-9; 56 12 L1 miss (all local for NoDirPred) 0 L1 miss to local L2 (DirCC) 10 U 10 08 _j 4 0 L1 miss to remote L2 (DirCC) 2 _ 0 U - . --- - 0 - Figure 4-7: Breakdown of Li miss rate NoDirPred reduces out migration traffic (migrations to non-native cores) by 52% and back migration traffic (migrations back to native cores) by 68% compared to NoDirPred-Full.The reduction in out migration traffic is achieved by our predictor (the useful register field) and the reduction in back migration traffic is achieved by the W-mask, which keeps track of the written registers. While using partial context migration occasionally induces unnecessary migrations due to register misses, we observe almost no overhead from this because our predictor learns from each miss by adding the missing register to the useful register mask for the appropriate PC. With this union mechanism, however, the register mask will only grow and never shrink back; this makes our context prediction conservative and thus, some of the registers that are migrated may not be actually used. Across all benchmarks, around 75% of migrated registers are actually used on average (see Figure 4-10), showing that our predictor is reasonably efficient. 4.4.2 The Effects of Network Parameters We further demonstrate that the relative performance and network traffic of our hybrid architecture (NoDirPred)are maintained over different network parameters. Figure 4-11 shows that NoDirPredoutperforms NoDirRA by 29% with 3-cycle per-hop 57 U NoDirRA m NoDirDist U NoDirPred 7 087.3 ~50 40 -- - - - - - - - - - 30 20 10 0 OM=n Figure 4-8: Core miss rate for directoryless systems latency (originally, 25% with 2-cycle per-hop latency); this is because the round-trip nature of remote accesses suffers more from increased per-hop latency. With a 64-bit flit network instead of 128-bit, on the other hand, the network traffic reduction rate of NoDirPredover NoDirRA decreases from 55% to 43%; this is because a large fraction of remote access messages (i.e., those that do not carry a data word) fit into 64 bits, and do not need additional flits to make up for the halved bandwidth. Performance improvements also drop slightly, but not significantly. 4.5 Chapter Summary In this chapter, we have extended our PC-based migration predictor to support partial context migration in order to reduce the size of the migrated context. With significantly reduced migration costs, our evaluation results show that the migration predictor exploits data locality to maximum advantage with minimal migration costs: it performs better than the remote-access-only baseline by 25% on average, while incurring less network traffic by 55% using partial context migrations. We have further demonstrated that, for certain applications, a directoryless architecture with fine-grained partial-context thread migration can outperform or match 58 " Reg-miss Migration * Out Migration U Back Migration E Remote Access 1.2 11 o0 0.8 0.4 - - 0.2 0 NoDirRA NoDirPred-Full NoDirPred Figure 4-9: Network traffic breakdown directory-based coherence with less on-chip traffic. While the performance of our architecture is 18% worse than the directory-based cache-coherent architecture on average, the network traffic is reduced by 24%; given that the architecture requires no directories or complicated coherence protocols, we believe that our approach points to promising avenues for simplified hardware shared memory support on many-core CMPs. 59 MUnused Used Registers Registers 20 1500 0 0- 0 age co r 0 4 Figure 4-10: Breakdown of migrated context into used and unused registers *DirCC E NoDirRA E NoDirPred S1.8 o E5 o 1.6 1.4 1.2 .N 0.8 M 0.6 L z 0.4 0.2 -- 0 3-cy cle hop 64-bit flit 3-cy cle hop 64-bit flit Parallel Completion Time Network Traffic Figure 4-11: The effect of network latency and bandwidth on performance and network traffic 60 Chapter 5 The EM 2 silicon implementation 5.1 Introduction In previous chapters, we have presented a hardware mechanism for fine-grained thread migration, and used the technique to complement remote access for a directoryless architecture with our migration predictor; the predictor not only decides whether to migrate a thread or perform a remote access, but also supports partial context migration. To confirm that such an architecture is indeed realizable in actual hardware, we implemented and fabricated a proof-of-concept chip that demonstrates the feasibility of our approach, namely the Execution Migration Machine (EM 2 ). The actual implementation process also allows us to explore in detail the microarchitecture of our proposed schemes. This chapter discusses the design decisions and implementation details of the EM2 chip, a 110-core shared-memory processor, that supports thread migration and remote access. The evaluation results using the RTL-level simulation of several benchmarks are also provided. 61 Off-chip memory 10 mm Off-chip memory 10 mm Figure 5-1: Chip-level layout of the 110-core EM2 chip 5.2 5.2.1 EM' Processor System architecture The physical chip comprises approximately 357,000,000 transistors on a 10 mm x 10 mm die in 45nm ASIC technology, using a 476-pin wirebond package. The EM2 chip consists of 110 homogeneous tiles placed on a 10 x 11 grid. In lieu of a DRAM interface, our test chip exposes the two networks that carry off-chip memory traffic via a programmable rate-matching interface; this, in turn, connects to a maximum of 16GB of DRAM via a controller implemented in an FPGA. The EM2 chip layout is shown in Figure 5-1. Tiles are connected in a 2D mesh geometry by six independent on-chip networks: two networks carry migration/eviction traffic, another two carry remote-access requests/responses, and a further two external DRAM requests/responses; in each case, two networks are required to ensure deadlock-free operation [14]. The six channels are 62 917 m. .. 855 urn Six 64-bit links Figure 5-2: EM 2 Tile Architecture implemented as six physically separate on-chip networks, each with its own router in every tile. Each network carries 64-bit flits using wormhole flow control and dimension order routing. The routers are ingress-buffered, and are capable of single-cycle forwarding under congestion-free conditions, a technique feasible even in multi-GHz designs [38]. While using a single network with six virtual channels would have utilized available link bandwidth more efficiently and made inter-tile routing simpler, it would have exponentially increased the crossbar size and significantly complicated the allocation logic (the number of inputs grows proportionally to the number of virtual channels and the number of outputs to the total bisection bandwidth between adjacent routers). Moreover, using six identical networks allowed us to verify in isolation the operation of a single network, and then safely replicate it six times to form the interconnect, significantly reducing the total verification effort. 63 5.2.2 Tile architecture Figure 5-2 shows an EM2 tile; each tile contains six Network-on-Chip (NoC) routers as described in Chapter 5.2.1, a processor core, a migration predictor, and a single level (L1) of instruction and data caches: an 8KB read-only instruction cache and a 32KB data cache per tile, resulting in a total of 4.4MB on-chip cache capacity. The caches are capable of single-cycle read hits and two-cycle write hits. The entire memory address space of 16GB is divided into 110 non-overlapping regions as required by the EM 2 shared memory semantics, and each tile's data cache may only cache the address range assigned to it. In addition to serving local and remote requests for the address range assigned to it, the data cache block also provides an interface to remote caches via the remote-access protocol. Memory is word-addressable and there is no virtual address translation; cache lines are 32 bytes. The details of the EM 2 processor core architecture is described in the next section. 5.2.3 Stack-based core architecture instruction cache daacache ; PC PC main stack aux aux stack main stack stack guest context native context Figure 5-3: The stack-based processor core diagram of EM 2 To simplify the implementation of partial context migration and maximally reduce on-chip bit movement, EM2 cores implement a custom 32-bit stack-based architecture (cf. Figure 5-3). Since the likelihood of the context being necessary increases toward 64 the top of the stack from the nature of a stack-based ISA, a migrating thread can take along only as much of its context as is required by only migrating the top part of the stack. Furthermore, the amount of the context to transfer can be easily controlled with a single parameter, which is the depth of the stack to migrate (i.e., the number of stack entries from the top of the stack). To ensure deadlock-free thread migration in all cases, the core contains two thread contexts, called a native context and a guest context (both contexts share the same 1$ port, which means that they do not execute concurrently). Each thread has a unique native context where no other thread can execute; when a thread wishes to execute in another core, it must execute in that core's guest context [14]. Functionally, the two contexts are nearly identical; the differences consist of the data cache interface in the native context that supports stack spills and refills (in a guest context stacks are not backed by memory, and stack underflow/overflow causes the thread to migrate back to its native context where the stacks can be spilled or refilled), and the thread eviction logic and associated link to the on-chip eviction network in the guest context. To reduce CPU area, the EM2 core contains neither a floating point unit nor an integer divider circuit. The core is a two-stage pipeline with a top-of-stack bypass that allows an instruction's arguments to be sourced from the previous instruction's ALU outputs. Each context has two stacks, main and auxiliary: most instructions take their arguments from the top entries of the main stack and leave their result on the top of the main stack, while the auxiliary stack can only be used to copy or move data from/to the top of the main stack; special instructions rearrange the top four elements of the main stack. The sizes of the main stack and the auxiliary stack are 16 and 8 entries. On stack overflow or underflow, the core automatically spills or refills the stack from the data cache; in a sense, the main and auxiliary stacks serve as caches for conceptually infinite stacks stored in memory. 5.2.4 Thread migration implementation Whenever the thread migrates out of its native core, it has the option of transmitting only the part of its thread context that it expects to use at the destination core. 65 Destination core Source core stack PC Istack I. Context unload PC flits B body [V (B cycles) (I cy cle) Body #2 ....#. II. Travel H hops (H cycles) Ill. Context load (1 cycle) HeadfHead Migration sta rt I I| Ill IV ++->< --------------- ----------+---- ++--> Head flit: ------------- >+----+ +----++ Body flit#1: Bodyflit#2: +--->+------------- Migration do"e +K------------ --- > 2 Figure 5-4: Hardware-level thread migration via the on-chip interconnect under EM . Only the main stack is shown for simplicity. In each packet, the first (head) flit encodes the destination packet length as well as the thread's ID and the program counter, as well as the number of main stack and auxiliary stack elements in body flits that follow. The smallest useful migration packet consists of one head flit and one body flit which contains two 32-bit stack entries. Migrations from a guest context must transmit all of the occupied stack entries, since guest context stacks are not backed by memory. Figure 5-4 illustrates how the processor cores and the on-chip network efficiently support fast instruction-granularity thread migration. When the core fetches an instruction that triggers a migration (for example, because of a memory access to data cached in a remote tile), the migration destination is computed and, if there is no network congestion, the migration packet's head flit is serialized into the on-chip router buffers in the same clock cycle. While the head flit transits the on-chip network, the remaining flits are serialized into the router buffer in a pipelined fashion. Once the packet has arrived at the destination NoC router and the destination core context is free, it is directly deserialized; the next instruction is fetched as soon as the program counter is available and the instruction cache access proceeds in parallel with the 66 deserialization of the migrated stack entries. In our implementation, assuming a thread migrates H hops with B body flits, the overall thread migration latency amounts to 1 + H + 1 + B cycles from the time a migrating instruction is fetched at the source core to when the thread begins execution at the destination core. In the EM 2 chip, H varies from 1 (nearest neighbor core) to 19 (the maximum number of hops for 10x 11 mesh), and B varies from 1 (two main stack entries and no auxiliary stack entries) to 12 (sixteen main stack entries and eight auxiliary stack entries, two entries per flit); this results in the very low migration latency, ranging from the minimum of 4 cycles to the maximum of 33 cycles (assuming no network congestion).1 While a native context is reserved for its native thread and therefore is always free when this thread arrives, a guest context might be executing another thread when a migration packet arrives. In this case, the newly arrived thread is buffered until the currently executing thread has had a chance to complete some (configurable) number of instructions; then, the active guest thread is evicted to make room for the newly arrived one. During the eviction process the entire active context is serialized just as in the case of a migration (the eviction network is used to avoid deadlock), and once the last flit of the eviction packet has entered the network the newly arrived thread is unloaded from the network and begins execution. 5.2.5 The instruction set We briefly describe the instruction set architecture (ISA) of the EM 2 core below. Stacks. Each core context contains a main stack (16 entries) and an auxiliary stack (8 entries), and instructions operate the top of those stacks much like RISC instructions operate on registers. On stack overflow or underflow, the core automatically accesses the data cache to spill or refill the core stacks. Stacks naturally and elegantly support partial context migration, since the topmost entries which are migrated as a partial context are exactly the ones that the next few instructions will use. 'Although it is possible to migrate with no main stack entries, this is unusual, because most instructions require one or two words on the stack to perform computations. The minimum latency in this case is still 4 cycles, because execution must wait for the I$ fetch to complete anyway. 67 The core implements the usual arith- Computation and stack manipulation. metic, logical, and comparison instructions on 32-bit integers, with the exception of hardware divide. Those instructions consume one or two elements from the main stack and push their results back there. Instructions in the push class place immediates on the stack, and variants that place the thread ID, core ID, or the PC on top of the stack help effect inter-thread synchronization. To make stack management easier, the top four entries of the main stack can be rearranged using a set of stack manipulation instructions. Access to deeper stack entries can be achieved via instructions that move or copy the top of the main stack onto the auxiliary stack and back. Control flow and explicit migration. Flow control is effected via the usual conditional branches (which are relative) and unconditional jumps and calls (relative or absolute). Threads can be manually migrated using the migrate instruction, and efficiently spawned on remote cores via the newthread instruction. Memory instructions. Word-granularity loads and stores come in EM (migrating) and RA (remote access) versions, as well as in a generic version which defers the decision to the migration predictor. The EM and generic versions encode the stack depths that should be migrated, which can be used instead of the predictor-determined depths. Providing manual and automatic versions gives the user both convenience and maximum control. Similarly, stores come in acked as well as fire-and-forget variants. Together with per-instruction memory fences, the ack variant provides sequential consistency while the fire-and-forget version may be used if a higher-level protocol obviates the need for per-word guarantees. Load-reserve and store-conditional instructions provide atomic read-modify-write access, and come in EM and RA flavors. 68 to core to core scan in No D a lockup reg - a-D config reg D a lockup reg 0 D Q - + scan out config reg clock 1 clock 2 Figure 5-5: The two-stage scan chain used to configure the EM 2 chip 5.2.6 System configuration and bootstrap To initialize the EM2 chip to a known state during power-up, we chose to use a scan-chain mechanism. Unlike the commonly employed bootloader strategy, in which one of the cores is hard-coded with a location of a program that configures the rest of the system, successful configuration via the scan-chain approach does not rely on any cores to be operating correctly: the only points that must be verified are (a) that bits correctly advance through the scan chain, and (b) that the contents of the scan chain are correctly picked up by the relevant core configuration settings. In fact, other than a small state machine to ensure that caches are invalidated at reset, the EM 2 chip does not have any reset-specific logic that would have to be separately verified. The main disadvantages here are (a) that the EM 2 chip is not self-initializing, i.e., that system configuration must be managed external to the chip, and (b) that configuration at the slow rate permitted by the scan chain will take a number of minutes. For an academic chip destined to be used exclusively in a lab environment, however, those disadvantages are relatively minor and worth offloading complexity from the chip itself onto test equipment. The scan chain itself was designed specifically to avoid hold-time violations in the physical design phase. To this end, the chain uses two sets of registers and is driven by two clocks: the first clock copies the current value of the scan input (i.e., the previous link in the chain) into a "lockup" register, while the second moves the lockup register value to a "config" register, which can be read by the core logic (see Figure 5-5). By 69 suitably interleaving the two scan clocks, we ensure that the source of any signal is the output of a flip-flop that is not being written at the same clock edge, thus avoiding hold-time issues. While this approach sacrificed some area (since the scan registers are duplicated), it removed a significant source of hold-time violations during the full-chip assembly phase of physical layout, likely saving us time and frustration. 5.2.7 Virtual memory and OS implications Although our test chip follows the accelerator model and does not support virtual memory and does not require a full operating system, fine-grained migration can be equally well implemented in a full-fledged CPU architecture. Virtual addressing at first sight potentially delays the local-vs-remote decision by one cycle (since the physical address must be resolved via a TLB lookup), but in a distributed shared cache architecture this lookup is already required to resolve which tile caches the data (if the LI cache is virtually addressed, this lookup can proceed in parallel with the LI access as usual). Program-initiated OS system calls and device access occasionally require that the thread remain pinned to a core for some number of instructions; these can be accomplished by migrating the thread to its native context on the relevant instruction. 2 OS-initiated tasks such as process scheduling and load rebalancing typically take place at a granularity of many milliseconds, and can be supported by requiring each thread to return to its native core every so often. 5.3 Migration Predictor for EM' 5.3.1 Stack-based Architecture variant As shown in previous chapters, EM2 can improve performance and reduce on-chip traffic by turning sequences of memory accesses to the same remote cache into migrations followed by local cache accesses. To detect sequences suitable for migration, each EM 2 core implements a learning migration predictor-a program counter (PC)-indexed, 2 In fact, our ASIC implementation uses this approach to allow the program to access various statistics tables. 70 Migration Predictor Fetch PC 27 Predictor storage for lookup 5 Index VNad Stack transfer size Main Aux Tag 1 8 0 31 Fetch/Decode stage Specified number of stack entries are sent. (Only used when migrating from the native core) Hit = Migrate Stack transfer size Execute stage Execute PC for memory instruction Home core for memory address Prediction accuracy feedback module for learning Home core ID # of contiguous accesses e/Iang xi sMonitoring First PC 1"- Figure 5-6: Integration of a PC-based migration predictor into a stack-based, two-stage pipelined core of EM 2 direct-mapped data structure shown in Figure 5-6. In addition to detecting migrationfriendly memory references and making a remote-access vs migration decision for every non-local load and store, our predictor reduces on-chip network traffic by learning and deciding how much of the stack should be transferred for every migrating instruction. The predictor bases these decisions on the instruction's PC. In most programs, sequences of consecutive memory accesses to the same home core and context usage patterns within those sequences are highly correlated with the instructions being executed, and those patterns are fairly consistent and repetitive across program execution. Each predictor has 32 entries, each of which consists of a tag for the PC and the transfer sizes for the main and auxiliary stacks. Detecting contiguous access sequences. While the detection mechanism is mostly similar to the one described in Chapter 3.2.2, we describe it here as well 71 in order to provide a self-contained view of the migration predictor design in the EM2 chip. Initially, the predictor table is empty, and all instructions are predicted to be remote-access. To detect memory access sequences suitable for migration, the predictor tracks how many consecutive accesses to the same remote core have been made, and, if this count exceeds a (configurable) threshold 0, inserts the PC of the instruction at the start of the sequence into the predictor. To accomplish this, each thread tracks (1) home, which maintains the home location (core ID) for the memory address being requested, (2) depth, which indicates how many times thus far a thread has contiguously accessed the recent home location (i.e., the home field), and (3) start PC, which tracks the PC of the first instruction that accessed memory at the home core. As shown in Figure 5-6, these data structures within the migration predictor interfaces with the execute stage of the core. When a thread T executes a memory instruction for address A whose PC is P, it must 1. find the home core H for A (e.g., by masking the appropriate bits); 2. if home = H (i.e., memory access to the same home core as that of the previous memory access), (a) if depth < 0, increment depth by one; (b) otherwise, if depth = 0, insert start PC into the predictor table; 3. if home = H (i.e., a new sequence starts with a new home core), (a) if depth < 0, invalidate any existing entry for start PC in the predictor table (thus making start PC non-migratory); (b) reset the current sequence counter (i.e., home depth <- +-- H, start PC <- P, 1). When an instruction is first inserted into the predictor, the stack transfer sizes for the main and auxiliary stack are set to the default values of 8 (half of the main stack) and 0, respectively. 72 -PCI 2, 0 Native "'uest Migrates with 2 main stack entries --- --- I! Migrates with all valid stack entries (a) Migrating from a native core I PC,/4 , (b) Migrating from a guest core 1%1% Native # of memory accesses migrated core < e .at Native add underflow (c) Learning the best context size (d) Learning from misprediction Figure 5-7: Decision/Learning mechanism of the migration predictor 5.3.2 Partial Context Migration Policy Migration prediction for memory accesses. The predictor uses the instruction's address (i.e., the PC) to look up the table of migrating sequences. When a load or store instruction attempts to access an address that cannot be cached at the core where the thread is currently running (a core miss) at the execute stage, the result of the predictor lookup (at the fetch stage) is used: if the PC is in the table, the predictor instructs the thread to migrate; otherwise, to perform a remote access. When the predictor instructs a thread to migrate from its native core to another core, it also provides the number of main and auxiliary stack entries that should be migrated (cf. Figure 5-7a). Because the stacks in the guest context are not backed by memory, however, all valid stack entries must be transferred (cf. Figure 5-7b). Feedback and learning. To learn how many stack entries to send when migrating from a native context at runtime, the native context keeps track of the start PC that caused the last migration. When the thread arrives back at its native core, it reports the reason for its return: when the thread migrated back because of stack overflow 73 (or underflow), the stack transfer size of the corresponding start PC is decremented (or incremented) accordingly (cf. Figure 5-7c). In this case, less (or more) of the stack will be brought along the next time around, eventually reducing the number of unnecessary migrations due to stack overflow and underflow. The returning thread also reports the number of local memory instructions it executed at the core it originally migrated to. If the thread returns without having made 0 accesses, the corresponding start PC is removed from the predictor table and the access sequence reverts to remote access (cf. Figure 5-7d).' This allows the predictor to respond to runtime changes in program behavior. 5.3.3 Implementation Details Guided by the goal of simplicity and verification efficiency, we chose to implement one per-core migration predictor, shared between the two contexts in each core (native and guest), rather than dual per-core predictors (one for the native context and one for the guest), or per-thread predictors whose state is transferred as a part of the migration context. The per-thread predictor scheme was easy to reject because it would have significantly increased the migrated context size and therefore violated our goal of the most efficient thread migration mechanism. The dual predictor solution, on the other hand, could in theory improve predictions because the two threads running on a core would not "pollute" each other's predictor tables-at the cost of additional area and verification time. Instead, we chose to preserve simplicity and implement a single per-core predictor shared between the native and guest contexts, sizing the predictor tables so that our tests showed no noticeable performance degradation (32 entries). Table 5.1 shows the interface of the migration predictor in EM 2 . 3 Returns caused by evictions from the remote core do not trigger removal, since the thread might have completed 0 accesses had it not been evicted. 74 Port Name Direction Description CLK RSTN IN IN Clock signal Reset signal Fetch stage lookup-en IN lookup-pc[31:0] is-em st1_xfer-size[2:0] IN OUT OUT st2xfer-size OUT 0/1: High when looking up the predictor to see if it needs to migrate or not for lookup-pc. PC at Fetch stage for the predictor lookup 0 for remote access; 1 for thread migration. Migration context size for the primary stack; can vary among 2, 4, 6, 8, 10, 12 and 14 entries. Migration context size for the auxiliary stack; 0 for none, 1 for 2 entries. Execute stage tracker-en IN tracker-reset IN tracker-sel IN exec pc[31:0] IN home-core[6:0] threshold[4:0] IN IN check-native-pc IN mig-type[5:0] IN run-length[4:0] IN High for memory instructions to keep track of run lengths. Clears the selected tracker (tracker-sel). Must clear the tracker before a thread migrates out. 0/1 : selects the corresponding tracker for the specific hardware context (native/guest). PC at exec stage. Updates the tracker (keep track of home core and run length). Home core ID for exec-pc Threshold of run length for a PC to be considered as EM. Assumed to be a constant value throughout the execution. High only when a thread migrates back to its native core and updates the predictor according to the "mig-type". Either deletes the PC or increment/decrement the transfer context size. "exec-pc" holds the PC that made the thread to migrate out, which is used for this check. Specifies the cause of migration to the native core. There are six possible causes and each corresponds to one bit: STI underflow, STI overflow, ST2 underflow, ST2 overflow, eviction and core Miss. The number of memory instructions performed while a thread was away from its native core. Table 5.1: Interface ports of the migration predictor in EM 2 75 5.4 5.4.1 Physical Design of the EM' Processor Overview Our primary purpose of building the proof-of-concept EM2 chip is to demonstrate the benefit of fine-grained partial context migration for a directoryless architecture and how the technique scales with a large number of cores. As such, our major design goal was to implement our proposed scheme with more than 100 cores. Having a large number of cores is also important since it directly relates to the performance of hardware-level thread migration. This requirement imposed tight constraints in terms of area and power consumption for each tile, because our total die area was 10 mmx 10 mm and the power budget for the entire chip was limited to around 12W from the number of power pins we have for the chip. Therefore, throughout the physical design process, we focused on reducing area and power consumption, rather than making the processor able to run at high clock speed. We also made several decisions to simplify designs that are not at the heart of the proposed architecture we evaluate (e.g., memory interface, routers, etc.), and along with the verification scalability of our design (more details described in Chapter 5.6.4), the entire EM 2 chip design and implementation took only 18 man-months. In terms of CAD tools, we used Synopsys Design Compiler to synthesize the RTL code, and Cadence Encounter was used for placement and routing (P&R). The signoff timing closure was done by Synopsys Primetime static timing analysis (STA). 5.4.2 Tile-level Figure 5-8 shows the layout of a single EM 2 tile. The tile was synthesized with the clock period of 3ns (i.e., aiming the clock frequency of 333 MHz) and the dimensions of the tile are 855umx917um, resulting in the area of 0.784mm2 . As shown in Figure 5-8, SRAM blocks used for instruction and data caches take almost half of the tile area. For the rest of the modules other than SRAMs, we allowed ungrouping during synthesis to maximize the area efficiency; as a result, we can observe that the 76 Core Router SRAM Predictor Figure 5-8: EM2 Tile Layout routers are placed along the border of the tile to reduce latency between neighboring tiles. The migration predictor accounts for about 2.6% of the EM 2 tile area. To reduce power consumption, we first decided to use the high-voltage threshold (HVT) standard cell library instead of the regular-voltage threshold (RVT) cells; this reduced the leakage power of the EM 2 tile by 38%, and although the HVT cells have slower switching speed, this was not an issue since our target frequency was not high. We also used the automatic clock gating provided by Synopsys Design Compiler, which reduced the dynamic power of the EM 2 tile by 67%. The power reduction by each step is shown in Table 5.2. 77 Internal Power (mW) Switching Power (mW) Leakage Power (mW) Total Power (mW) RVT HVT and HVT Clock-gating 40.8611 1.3481 32.569 74.7783 40.1873 1.3579 20.465 62.0103 13.3454 2.0070 18.793 34.1448 Table 5.2: Power estimates of the EM 2 tile (reported by Design Compiler) 5.4.3 Chip-level As an effective evaluation of the potential of our migration architecture directed us towards as large a core count as feasible in our 10 mmx 10 mm of silicon, our final taped-out EM 2 chip includes 110 tiles, laid out in a 2D grid. For design simplicity and verification efficiency, EM 2 implements a homogeneous tiled architecture; out of 110 tiles in the EM 2 ASIC, 108 are identical, while the remaining two include interfaces to off-chip memory (cf. Figure 5-1). With this hierarchical design, our bottom-up approach allowed us to simply replicate the layout of a single tile for the chip-level design. The sign-off timing closure was done with the clock frequency of 200 MHz. While resolving setup time violations was not a big issue for EM 2 , removing hold time violations was not trivial, which actually is more critical for a chip to function correctly since they cannot be fixed after fabrication. While CAD tools (e.g., Encounter) solve hold time violations commonly by inserting delay cells along the data paths, in our design, hold time violations for the data paths between the neighboring routers were not easily removed in this manner because the space between the two tiles was too small to accommodate enough number of delay cells. Therefore, we instead inserted a negative-edge flip-flop for each of these particular paths of the router, which automatically gives an extra delay of half a clock cycle to the data path, solving hold time violations. The EM 2 chip die photo is shown in Figure 5-9. 78 Figure 5-9: Die photo of the 110-core EM2 chip 5.5 5.5.1 Evaluation Methods RTL simulation To evaluate the EM 2 implementation, we chose an idealized cache-coherent baseline architecture with a two-level cache hierarchy (a private Li data cache and a shared L2 cache). In this scheme, the L2 is distributed evenly among the 110 tiles and the size of each L2 slice is 512KB. An Li miss results in a cache line being fetched from the L2 slice that corresponds to the requested address (which may be on the same tile as the Li cache or on a different tile). While this cache fetch request must still traverse the network to the correct L2 slice and bring the cache line back, our cache-coherent 79 baseline is idealized in the sense that rather than focusing on the details of a specific coherence protocol implementation, it does not include a directory and never generates any coherence traffic (such as invalidates and acknowledgements); coherence among caches is ensured "magically" by the simulation infrastructure. While such an idealized implementation is impossible to implement in hardware, it represents an upper bound on the performance of any implementable directory coherence protocol, and serves as the ultimate baseline for performance comparisons. To obtain the on-chip traffic levels and completion times for our architecture, we began with the post-tapeout RTL of the EM2 chip, removed such ASIC-specific features as scan chains and modules used to collect various statistics at runtime, and added the same shared-L2 cache hierarchy as the cache-coherent baseline. Since our focus is on comparing on-chip performance, the working set for our benchmarks is sized to fit in the entire shared-L2 aggregate capacity. All of the simulations used the entire 110-core chip RTL; for each benchmark, we report the completion times as well as the total amount of on-chip network traffic (i.e., the number of times any flit traveled across any router crossbar). The ideal CC simulations only run one thread in each core, and therefore only use the native context. Although the EM2 simulations can use the storage space of both contexts in a given core, this does not increase the parallelism available to EM 2 : because the two contexts share the same 1$ port, only one context can be executing an instruction at any given time. Both simulations use the same 8 KB LI instruction cache as the EM2 chip. Unlike the PC, instruction cache entries are not migrated as part of the thread context; while this might at first appear to be a disadvantage when a thread first migrates to a new core, we have observed that in practice at steady state the 1$ has usually already been filled (either by other threads or by previous iterations that execute the same instruction sequence), and the I$ hit rate remains high. 80 5.5.2 Area and power estimates Area and power estimates were obtained by synthesizing RTL using Synopsys Design Compiler (DC). For the EM 2 version, we used the post-tapeout RTL with the scanchains and statistics modules deleted; we reused the same IBM 45nm SOI process with the ARM sc12 low-power ASIC cell library and SRAM blocks generated by IBM Memory Compiler. Synthesis targeted a clock frequency of 800MHz, and leveraged DC's automatic clock-gating feature. To give an idea of how these costs compare against that of a well-understood, realistic architecture, we also estimated the area and leakage power of an equivalent design where the data caches are kept coherent via a directory-based MESI protocol (CC). We chose an exact sharer representation (one bit for each of the 110 sharers) and either the same number of entries as in the data cache (CC 100%) or half the entries (CC 50%); in both versions the directory was 4-way set-associative. To estimate the area and leakage power of the directory, we synthesized a 4-way version of the data cache controller from EM 2 chip with SRAMs sized for each directory configuration, using the same synthesis constraints (since a directory controller is somewhat more complex than a cache controller, this approach likely results in a slight underestimate). For area and leakage power, we report the synthesis estimates computed by DC. While all of these quantities typically change somewhat post-layout (because of factors like routing congestion or buffers inserted to avoid hold-time violations), we believe that synthesis results are sufficient to make architectural comparisons. Dynamic power dominates the power signature, but is highly dependent on the specific benchmark, and obtaining accurate estimates for all of our benchmark is not practical. Instead, we observe that for the purposes of comparing EM2 to the baseline architecture, it suffices to focus on the differences, which consist of (a) the additional core context, (b) the migration predictor, and (c) differences in cache and network accesses. The first two are insignificant: our implementation allowed only one of the EM 2 core contexts to be active in any given cycle, so even though the extra contexts adds leakage, dynamic power remains constant. The migration predictor is a small 81 part of the tile and does not add much dynamic power according to our analysis. Since we ran the same programs and pre-initialized caches, the cache accesses were the same, meaning equal contribution to dynamic power. The only significant difference is in the dynamic network power, which is directly proportional to the on-chip network traffic (i.e., the number of network flits sent times the distance traveled by each flit); we therefore report this for all benchmarks as a proxy for dynamic power. 5.6 Evaluation Performance tradeoff factors 5.6.1 To precisely understand the conditions under which fast thread migration results in improved performance, we created a simple parameterized benchmark that executes a sequence of loads to memory assigned to a remote L2 slice. There are two parameters: the run length is the number of contiguous accesses made to the given address range, and cache misses is the number of Li misses these accesses induce (in other words, this determines the stride of the access sequence); we also varied the on-chip distance between the tile where the thread originates and the tile whose L2 caches the requested addresses. 200 - Network traffic (flitxhop) Completion time (cycles) 100 160 -80 E RA-only 1 RA-only 2 120 a EM -12 *EM 80 2 2 40 2 60 M EM -12 -8 40- aEM -8 -4 2EM 20 *EM -4 2 2 0 1 4 1 8 4 8 Run length Run length Figure 5-10: Thread migration (EM 2 ) vs Remote access (RA) Figure 5-10 shows how a program that only makes remote cache accesses (RA-only) compares with a program that migrates to the destination core 4 hops away, makes the memory accesses, and returns to the core where it originated (EM 2 ), where the 82 migrated context size is 4, 8, and 12 stack entries (EM2 -4, EM2 -8, and EM 2 -12). Since the same Li cache is always accessed-locally or remotely-both versions result in exactly the same LI cache misses, and the only relevant parameter is the run length. For a singleton access (run length = 1), RA is slightly faster than any of the migration variants because the two migration packets involved are longer than the RA request/response pair, and, for the same reason, induce much more network traffic. For multiple accesses, however, the single migration round-trip followed by local cache accesses performs better than the multiple remote cache access round trips, and the advantage of the migration-based solution grows as the run length increases. Completion time (cycles) Network traffic (flitxhop) 250 250 200 200 - a CC-ideal 150 NEM2-12 _8 1E n EM2 -4 50 U CC-ideal 150 MEM2-12 00 EM -8 2 0 EMV2-4 50 6 0 0 1 2 4 8 1 2 4 8 Cache misses Cache misses Figure 5-11: Thread migration (EM 2 ) vs Private caching (CC) The tradeoff against our "ideal cache coherence" private-cache baseline (CC) is less straightforward than against RA: while CC will still make a separate request to load every cache line, subsequent accesses to the same cache line will result in LI cache hits and no network traffic. Figure 5-11 illustrates how the performance of CC and EM 2 depends on how many times the same cache line is reused in 8 accesses. When all 8 accesses are to the same cache line (cache misses = 1), CC requires one round-trip to fetch the entire cache line, and is slightly faster than EM2 , which needs to unload the thread context, transfer it, and load it in the destination core. Once the number of misses grows, however, the multiple round-trips required in CC become more costly than the context load/unload penalty of the one round-trip migration, and EM 2 performs better. And in all cases, EM 2 can induce less on-chip network traffic: even in the one-miss case where CC is faster, the thread context that EM 2 has 83 to migrate is often smaller than the CC request and the cache line that is fetched. Completion time (cycles) 350 300 250 MRA-only 200 - MCC-ideal 150 100 50 *EM 4 8 Number of hops 2 -8 12 Figure 5-12: The effect of distance on RA, CC and EM2 Finally, Figure 5-12 examines how the three schemes are affected by the on-chip distance between the core where the thread originates and the core that caches the requested data (with run length = 8 and cache misses = 2). RA, which requires a round-trip access for every word, grows the fastest (i.e., eight round-trips), while CC, which only needs a round-trip cache line fetch for every LI miss (i.e., two round-trips), grows much more slowly. Because EM 2 only requires one round-trip for all accesses, the distance traveled is not a significant factor in performance. 5.6.2 Benchmark performance Figure 5-13 shows how the performance of EM2 compares to the ideal CC baseline for several benchmarks. These include: (1) single-threaded memcpy in next-neighbor (near) and cross-chip (far) variants, (2) parallel k-fold cross-validation (par-cv), a machine learning technique that uses stochastic gradient learning to improve model accuracy, (3) 2D Jacobi iteration (jacobi), a widely used algorithm to solve partial differential equations, and (4) partial table scan (tbscan), which executes queries that scan through a part of a globally shared data table distributed among the cache shards. We first note some overall trends and then discuss each benchmark in detail below. 84 Completion time (normalized to CC-ideal) 3.5 3 M U E CC-ideal RA-only U EM 2 2.5 2 1.5 1 0.5 0 memcpy-near memcpy-far par-cv jacobi tbscan-16 tbscan-110 (a) Performance normalized to CC Network traffic (normalized to CC-ideal) 3 2.5 I RA-only N CC-ideal 0 EM 2 2 1.5 1 0.5 0 memcpy-near memcpy-far par-cv jacobi tbscan-16 tbscan-110 (b) Network traffic normalized to CC Figure 5-13: The evaluation of EM 2 Overall remarks. First, Figure 5-13a illustrates the overall performance (i.e., completion time) and on-chip network traffic of the ideal directory-based baseline (CC), the remote-access-only variant (RA), and the EM2 architecture. Overall, EM2 always outperforms RA, offering up to 3.9x reduction in run time, and as well or better than CC in all cases except one. Throughout, EM 2 also offers significant reductions in on-chip network traffic, up to 42x less traffic than CC for par-cv. Migration rates, shown in Figure 5-14a, range from 0.2 to 20.9 migrations per 1,000 instructions depending on the benchmark. These quantities justify our focus 85 U Stack - 25 $ 20 N Eviction N Data access - 0 IA =. C . 15 over/underflow - - -- - - - - - - - - - - - - - - -- 10 5 0U 5 - memcpy-near memcpy-far par-cv jacobi tbscan-16 tbscan- 110 (a) The number of migrations per thousand instructions =Avg. migration latency (cycles) -U-Avg. migration size (% of full context) 1000 250 200 80 00 +E' 0 100 ------------- __ ------- 1500 -0 40 E8 0 0 memcpy-near memcpy-far par-cv jacobi tbscan-16 tbscan-110 2 (b) Thread migration performance in EM Figure 5-14: Thread migration statistics under EM 2 on efficient thread movement: if migrations occur at the rate of nearly one in every hundred to thousand instructions, taking 1000+ cycles to move a thread to a different core would indeed incur a prohibitive performance impact. Most migrations are caused by data accesses, with stack under/overflow migrations at a negligible level, and evictions significant only in the tbscan benchmarks. Even with many threads, effective migration latencies are low (Figure 5-14b, bars), with the effect of distance clearly seen for the near and far variants of memcpy; the only exception here is par-cv, in which the migration latency is a direct consequence of delays due to inter-thread synchronization (as we explain below). At the same time, migration sizes (Figure 5-14b, line) vary significantly, and stay well below the 60% mark (44% on average): since most of the on-chip traffic in the EM2 case is due 86 to migrations, forgoing partial-context migration support would have significantly increased the on-chip traffic (cf. Figure 5-13b). Memory copy. The memcpy-near and memcpy-far benchmarks copy 32 KB (the size of an Li data cache) from a memory address range allocated to a next-neighbor tile (memcpy-near) or a tile at the maximum distance across the 110-core chip (mempcyfar). In both cases, EM 2 is able to repeatedly migrate to the source tile, load up a full thread context's worth of data, and migrate back to store the data at the destination addresses; because the maximum context size exceeds the cache line size that ideal CC fetches, EM 2 has to make fewer trips and performs better both in terms of completion time and network traffic. Distance is a significant factor in performance-the fewer round-trips of EM 2 make a bigger difference when the source and destination cores are far apart-but does not change the % improvement in network traffic, since that is determined by the the total amount of data transferred in EM 2 and CC. Partial table scan. In this benchmark, random SQL-like queries are assigned to separate threads, and the table that is searched is distributed in equal chunks among the per-tile L2 caches. We show two variants: a light-load version where only 16 threads are active at a time (tbscan-16) and a full-load version where all of the 110 available threads execute concurrently (tbscan-110); under light load, EM 2 finishes slightly faster than CC-ideal and significantly reduces network traffic (2.9x), while under full load EM 2 is 1.8x slower than CC-ideal and has the same level of network traffic. Why such a large difference? Under light load, EM2 takes full advantage of data locality, which allows it to significantly reduce on-chip network traffic, but performs only slightly better than CC-ideal because queries that access the same data chunks compete for access to the same core and effectively serialize some of the computation. Because the queries are random, this effect grows as the total number of threads increases (Figure 5-15), resulting in very high thread eviction rates under full load (Figure 5-14a); this introduces additional delays and network traffic as threads 87 ping-pong between their home core and the core that caches the data they need. N EM 2-N100 SEM2 -N10 2 I 01.5 0 S0.5 0 Z0 1 4 8 16 32 64 110 1 4 8 16 32 64 110 Network traffic Completion time Figure 5-15: Performance and network traffic with different number of threads for tbscan under EM 2 label :-LOOP Guest Migrates in ld 1d pull 2 1ld pull 3 id Eviction allowed after executing N instructions 2 Figure 5-16: N instructions before being evicted from a guest context under EM This ping-pong effect, and the associated on-chip traffic, can be reduced by guaranteeing that each thread can perform N (configurable in hardware) instructions before being evicted from a guest context, as illustrated in Figure 5-16. Figure 5-15 illustrates how tbscan performs when N = 10 and N = 100: a longer guaranteed guest-context occupation time results in up to 2 x reductions in network traffic at the cost of a small penalty in completion time due to the increased level of serialization. This highlights an effective tradeoff between performance and power: with more serialization, EM 2 can use far less dynamic power due to on-chip network traffic 88 (and because fewer cores are actively computing) if the application can tolerate lower performance. Parallel K-fold cross validation. As previously described in Chapter 3.3.1, par- allel k-fold cross-validationruns k independent leave-one-out experiments where each experiment requires the entire data samples. Since the model used in each experiment is necessarily sequential for sequential machine learning algorithms, each experiment naturally map to a thread; this is the natural form of parallelization. The data samples are split into k data chunks, which are typically spread across the shared caches; since each experiment repeatedly accesses a given chunk before moving on to the next chunk, it has a fairly high run length, which favors EM 2 . With overall completion time slightly better under EM 2 than under CC-ideal and much better than under RA-only, par-cv stands out for its 42x reduction in on-chip network traffic vs. CC-ideal (96x vs. RA). This is because the cost of every migration is amortized by a large amount of local cache accesses on the destination core (as the algorithm learns from the given data chunk), while CC-ideal continuously fetches more data to feed the computation. Completion time for par-cv, however, is only slightly better because of the nearly 200-cycle average migration times at full 110-thread utilization (Figure 5-14b). This is because of a serialization effect similar to that in tbscan: a thread that has finished learning on a given chunk and migrates to proceed onto the next chunk must sometimes wait en route while the previous thread finishes processing that chunk. Unlike tbscan, however, where the contention results from random queries, the threads in par-cv process the chunks in order, and avoid the penalties of eviction. As a result, at the same full utilization rate of 110 threads, par-cv has a better completion time under EM 2 but tbscan performs better under CC. (At a lower utilization, the average migration latency of par-cv falls: e.g., at 50 threads it becomes 9 cycles, making the EM 2 version 11% faster than CC.) 89 2D Jacobi iteration. In its essence, the jacobi benchmark propagates a computation through a matrix, and so the communication it incurs is between the boundary of the 2D matrix region stored in the current core and its immediate neighbors stored in the adjacent cores. Since the data accesses are largely to a thread's own private region, intercore data transfers are a negligible factor in the overall completion time, and the runtime for all three architectures is approximately the same. In the naive form, the local elements are computed one by one, and all of the memory accesses to remote cores become one-off accesses; in this case, the predictor never instructs threads to migrate and EM 2 will behave the same as the RA-only baseline. By using loop unrolling, however, the performance of EM2 can be improved: multiple remote loads are now performed contiguously, meaning that a thread migrates with a few addresses for loads, and migrates back with its stack filled with multiple load results (see Figure 5-17). In this manner, EM 2 is able to reduce the overall network traffic because it can amortize the costs of migrating by consecutively accessing many matrix elements in the boundary region, while CC-ideal has to access this data with several L2 fetches. While unrolling does not change the performance under the RA regime, it allows EM 2 to incur 31% less network traffic than RA. LD LD EM2 RA LD LD LD LD Figure 5-17: EM' allows efficient bulk loads from a remote core. 90 5.6.3 Area and power costs Since the CC-ideal baseline we use for the performance evaluation above does not have directories, it does not make a good baseline for area and power comparison. Instead, we estimated the area required for MESI implementations with the directory sized to 100% and 50% of the total Li data cache entries, and compared the area and leakage power to that of EM 2 . The L2 cache hierarchy, which was added for more realistic performance evaluation and not a part of the actual chip, is not included here for both EM 2 and CC. Table 5.3 summarizes the architectural components that differ. EM 2 requires an extra architectural context (for the guest thread) and on-chip networks for migrations and evictions as well as RA requests and responses. Our EM 2 implementation also includes a learning migration predictor; while this is not strictly necessary in a purely instruction-based migration design, it offers runtime performance advantages similar to those of a hardware branch predictor. In comparison, a deadlock-free implementation of MESI would replace the four migration and remote-access on-chip networks with three (for coherence requests, replies, and invalidations), implement D$ controller logic required to support the coherence protocol, and add the directory controller and associated SRAM storage. Figure 5-18 shows how the silicon area and leakage power compare. Not surprisingly, blocks with significant SRAM storage (the instruction and data caches, as well as the directory in the CC version) were responsible for most of the area in all variants. Overall, the extra thread context and extra router present in EM 2 were outweighed EM 2 CC extra execution context in the core migration predictor logic & storage remote cache access support in the D$ coherence protocol logic in the D$ coherence directory logic & storage number of independent on-chip networks yes yes yes no no 6 no no no yes yes 5 Table 5.3: A summary of architectural costs that differ in the EM 2 and CC implementations. 91 * routers N D$ slice M 1$ U dir. slice Area 400000 - Power Area Power Area M predictor 0 core Power 15 E 300000 - 10 200000 - 5 100000 0 0 EM2 CC100% CC50% Figure 5-18: Relative area and leakage power costs of EM 2 vs. estimates for exactsharer CC with the directory sized to 100% and 50% of the D$ entries (DC Ultra, IBM 45nm SOI hvt library, 800MHz). by the area required for the directory in both the 50% and 100% versions of MESI, which suggests that EM 2 may be an interesting option for area-limited CMPs. 5.6.4 Verification Complexity With evolving VLSI technology and increasing design complexity, verification costs have become more critical than ever. Increasing core counts are only making the problem worse because any pairwise interactions among cores result in a combinatorial explosion of the state space as the number of cores grows. Distributed cache coherence protocols in particular are well known to be notoriously complex and difficult to design and verify. The response to a given request is determined by the state of all actors in the system (for example, when one cache requests write access to a cache line, any cache containing that line must be sent an invalidate message); moreover, the indirections involved and the nondeterminism inherent in the relative timing of events requires a coherence protocol implementation to introduce many transient states that are not explicit in the higher-level protocol. This causes the number of actual states in even relatively simple protocols (e.g., MSI, MESI) to explode combinatorially [31, and 92 I. Module rCore III. 4-tile system II. Single-tile E1M0-tisystem a Cache RotrRouter Migration # bugs within each module OO ONM INMMMMMM Migration # inter-tile bugs # inter-module bugs No bugs introduced >> by increasing the system size Figure 5-19: Bottom-up verification methodology of EM 2 results in complex cooperating state machines driving each cache and directory [39]. In fact, one of the main sources of bugs in such protocols is reachable transient states that are missing from the protocol definition, and fixing them often requires non-trivial modifications to the high-level specification. To make things worse, many transient states make it difficult to write well-defined testbench suites: with multiple threads running in parallel on multicores, writing high-level applications that exercise all the reachable low-level transient states-or even enumerating those states-is not an easy task. Indeed, descriptions of more optimized protocols can be so complex that they take experts months to understand, and bugs can result from specification ambiguities as well as implementation errors [35]. Significant modeling simplifications must be made to make exploring the state space tractable [1], and even formally verifying a given protocol on a few cores gives no confidence that it will work on 100. While design and verification complexity is difficult to quantify and compare, both the remote-access-only baseline and the full EM 2 system we implemented have a significant advantage over directory cache coherence: a given memory address may only be cached in a single place. This means that any request-remote or local-will depend only on the validity of a given line in a single cache, and no indirections or transient states are required. The VALID and DIRTY flags that together determine the state of a given cache line are local to the tile and cannot be affected by state changes in other cores. The thread migration framework does not introduce additional 93 complications, since the data cache does not care whether a local memory request comes from a native thread or a migrated thread: the same local data cache access interface is used. The overall correctness can therefore be cleanly separated into (a) the remote access framework, (b) the thread migration framework, (c) the cache that serves the memory request, and (d) the underlying on-chip interconnect, all of which can be reasoned about separately. This modularity makes the EM2 protocols easy to understand and reason about, and enabled us to safely implement and verify modules in isolation and integrate them afterwards without triggering bugs at the module or protocol levels (cf. Figure 5-19). The homogeneous tiled architecture we chose for EM2 allowed us to significantly reduce verification time by first integrating the individual tiles in a 4-tile system. This resulted in far shorter simulation times than would have been possible with the 110 cores, and allowed us to run many more test programs. At the same time, the 4-tile arrangement exercised all of the inter-tile interfaces, and we found no additional bugs when we switched to verifying the full 110-core system, as shown in Figure 5-19. Unlike directory entries in directory-based coherence designs, EM2 cores never store information about more than the local core, and all of the logic required for the migration framework-the decision whether to migrate or execute a remote cache access, the calculation of the destination core, serialization and deserialization of network packets from/to the execution context, evicting a running thread if necessary, etc.-is local to the tile. As a result, it was possible to exercise the entire state space in the 4-tile system; perhaps more significantly, however, this also means that the system could be scaled to an arbitrary number of cores without incurring an additional verification burden. 5.7 Chapter Summary In this chapter, we have presented the 110-core EM2 chip, a silicon implementation of a directoryless architecture using thread migration and remote access. By employing a stack-based architecture, EM 2 minimizes thread migration costs and elegantly 94 supports partial context migration; the taped-out chip also supports on-line learning and prediction of when to migrate and what part of the context to send upon migration by the migration predictor. Through RTL simulation, we demonstrate that EM 2 can improve performance and reduce network traffic compared to the remote-access-only design, and for some benchmarks, when compared to the cache-coherent baseline as well. Moreover, since EM 2 is built on top of a directoryless memory substrate, it provides shared memory without the need of coherence protocol and directories, offsetting the area overhead of the migration framework while reducing verification complexity at the same time. 95 96 Chapter 6 Conclusions 6.1 Thesis contributions For conventional manycore CMPs with private caches, the data must be brought to the core where the thread is running whenever a thread needs data mapped on remote shared cache slices. When a thread repeatedly accesses data at the remote caches, however, this incurs large delays and significant network traffic. Furthermore, such private caches need to maintain cache coherence to support shared memory, often achieved by a complex coherence protocol and distributed directories. In this thesis, we first proposed a directoryless architecture that uses thread migration and remote access to access remotely mapped data. Since we do not allow cache line replication across on-chip caches, coherence is trivially ensured without the need for directories (and thus, we call it a directoryless architecture). At the same time, we use our fine-grained thread migration mechanism to complement remote word accesses in order to better exploit data locality under such an architecture. However, we observed that high migration costs make it critical to use thread migrations judiciously. Therefore, we have developed an on-line, PC-based migration predictor which decides between a remote access or a thread migration at instruction granularity. Moreover, we extend the migration predictor to support partial context thread migration by learning and predicting the necessary thread context at runtime, which further reduces migration costs. 97 To validate our proposed architecture, we further implemented a 110-core Execution Migration Machine (EM 2 ) processor using a 45nm ASIC technology. This thesis discusses the design and physical implementation details of our prototype chip which adopts a stack-based core architecture, and also provides detailed evaluation results using RTL-level simulation. Our results show that the our proposed architecture with the migration predictor can improve performance and significantly reduce network traffic compared to a remoteaccess-only architecture. We have also demonstrated that, for certain applications, our proposed design can outperform or match directory-based coherence with less on-chip traffic and reduced verification complexity. Given that the architecture requires no directories or complicated coherence protocols and, unlike directory-based coherence protocols, its verification scope does not grow with the number of cores, we believe that our approach provides an interesting design point on the hardware coherence spectrum for many-core CMPs. 6.2 Architectural assumptions and their implications While the architecture proposed in this dissertation assumes in-order, single-issue cores for the underlying hardware, modern processors often have more complex cores for higher performance. Here, we discuss the requirements and limitations of our proposed scheme on such complex cores, as well as the performance implications in parallel workloads with heterogeneous threads. Multiple outstanding memory accesses. Under single-issue in-order cores, a thread will not execute a memory instruction until its previous memory instruction completes; a thread could, therefore, start migrating without extra waiting. On the other hand, if multiple outstanding memory accesses are allowed (e.g., superscalar out-of-order cores), a thread could have multiple outstanding remote accesses on the fly at the time when it wishes to migrate under our directoryless architecture. In order 98 to provide functional correctness, therefore, the migration hardware needs to ensure that all the responses are received, i.e., that there exist no outstanding remote accesses, before a thread migration can actually happen. While this constraint is sufficient for a correct execution, it can possibly affect the migration decision mechanism. Since multiple remote accesses can now be interleaved hiding the latency, in order to make the cost of thread migration worthwhile, we might need a longer run length than we did in single-issue cores. In addition, it might also be beneficial to relax the notion of run length to not be the number of consecutive accesses but rather that of most frequent accesses to the same core because even the same sequence of memory instructions can execute in different order at runtime. In terms of network traffic reduction, it is important to note that interleaving multiple memory accesses does not help; migrating a thread can still reduce overall on-chip traffic. Deeply pipelined core. While we assume a five-stage pipeline core in this thesis, modern CPUs running at GHz frequencies often have deeper pipelines with more than ten stages. While no architectural changes are required to specifically support deeply pipelined cores for our design, the pipeline depth affects the cost of thread migration since a thread needs to re-execute from the beginning of the pipeline after migrating to another core (cf. Chapter 2.4). We have, however, observed that increasing the pipeline depth has a negligible effect on the overall performance due to low migration rates. Workloads with heterogeneous threads. A dominant class of multithreaded programs typically runs parallel worker threads with almost identical instruction streams (i.e., executes very similar instructions although on different data). While our architecture is not restricted to any specific applications, such a high instruction similarity between threads keeps the performance overhead due to extra I-cache misses reasonably low. In addition, thread interference in the migration predictor is also minimal for the same reason, and thus, not migrating the predictor content with a thread and allowing a per-core predictor to be shared among threads are sufficiently 99 efficient in terms of performance. There exists, however, another type of parallel workloads, where each thread executes different instructions. For example, streaming applications can assign threads to each pipeline stage to exploit pipeline parallelism; migrating a thread in such an application would result in a higher performance overhead because most necessary instructions are likely to be refetched to the instruction cache at the core where the thread has migrated. Some possible future solutions to address this overhead include taking the I-cache miss penalty into account when deciding whether to migrate or not, and/or sending at least one or two instruction cache blocks along with the thread context, which could possibly minimize the performance overhead (especially when used with instruction prefetch hardware) by trading off the cost of thread migration. 6.3 Future avenues of research Since no automatic data replication is allowed under our proposed directoryless architecture, it can limit the hardware's ability to take advantage of available parallelism, limiting performance benefits. We believe more ways to avoid this limitation, such as implementing thread migration on top of simplified hardware coherence or software coherence, can be explored. While this thesis focuses on using the migration infrastructure to accelerate remote data access and reduce network traffic for a directoryless shared-memory architecture, we view fine-grained partial-context thread migration as an enabling technology suitable for many applications. Being fast and efficient, thread migrations can happen more frequently to the level where conventional migration schemes (e.g., OS/softwarelevel migration) could not support due to their high costs. We believe investigating the possible applications of fine-grained migration can further lead future research. 100 Bibliography [1] Dennis Abts, Steve Scott, and David J. Lilja. So many states, so little time: Verifying memory coherence in the cray x1. In PDP, 2003. [2] Adapteva. Startup has big plans for tiny chip technology. In Wall Street Journal, 2011. [3] Arvind, Nirav Dave, and Michael Katelman. Getting formal verification into design flow. In FM2008, 2008. [4] M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic hardwareassisted software-controlled page placement to manage capacity allocation and sharing within large caches. In HPCA, 2009. [5] Moshe (Maury) Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi-Keung Luk, Gail Lyons, Harish Patil, and Ady Tal. Analyzing Parallel Programs with Pin. Computer, 43, 2010. Managing wire delay in large chip[6] M. M. Beckmann and D. A. Wood. multiprocessor caches. In MICRO, 2004. [7] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. TILE64 - Processor: A 64-Core SoC with Mesh Interconnect. In Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International,Feb 2008. [8] Shekhar Borkar. Thousand core chips: A technology perspective. In Proceedings of the 44th Annual Design Automation Conference, DAC '07, pages 746-749, New York, NY, USA, 2007. ACM. [9] Silas Boyd-Wickizer, Robert Morris, and M. Frans Kaashoek. scheduling for multicore systems. In HotOS, 2009. Reinventing [10] Jeffery A. Brown and Dean M. Tullsen. The shared-thread multiprocessor. In ICS, 2008. 101 [11] Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. Computation spreading: employing hardware migration to specialize CMP cores on-the-fly. In ASPLOS, 2006. [12] Jichuan Chang and Gurindar S. Sohi. Cooperative caching for chip multiprocessors. In ISCA, 2006. [13] M. Chaudhuri. PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. In HPCA, 2009. [14] Myong Hyon Cho, Keun Sup Shim, Mieszko Lis, Omer Khan, and Srinivas Devadas. Deadlock-free fine-grained thread migration. In NOCS, 2011. [15] Sangyeun Cho and Lei Jin. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. In MICRO, 2006. [16] Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, and Ching-Tsun Chou. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism. In PACT, 2011. [17] Tilera Corporation. Tilera announces tile-gx72, the world's highest performance and highest-efficiency manycore processor. In Tilera Press Release, Feb 2013. [18] Blas A. Cuesta, Alberto Ros, Maria E. G6mez, Antonio Robles, and Jos6 F. Duato. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 93-104, New York, NY, USA, 2011. [19] William J. Dally and Brian Towles. Principles and practices of interconnection networks. Morgan Kaufmann, 2003. [20] Socrates Demetriades and Sangyeun Cho. Stash directory: A scalable directory for many-core coherence. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th InternationalSymposium on, Feb 2014. [21] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc. Design of ion-implanted mosfet's with very small physical dimensions. Solid-State Circuits, IEEE Journal of, 9(5):256-268, Oct 1974. [22] A. DeOrio, A. Bauserman, and V. Bertacco. Post-silicon verification for cache coherence. In ICCD, 2008. [23] C. Fensch and M. Cintra. An OS-based alternative to full hardware coherence on tiled CMPs. In HPCA, 2008. 102 [24] M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi. Cuckoo directory: A scalable directory for many-core systems. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 169- 180, Feb 2011. [25] International Technology Roadmap for Semiconductors. 2012 Update Overview, 2012. [26] H. Garcia-Molina, R.J. Lipton, and J. Valdes. A Massive Memory Machine. IEEE Trans. Comput., C-33, 1984. [27] Anoop Gupta, Wolf dietrich Weber, and Todd Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In In InternationalConference on Parallel Processing,pages 312-321, 1990. [28] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. Reactive NUCA: near-optimal block placement and replication in distributed caches. In ISCA, 2009. [29] Rajeeb Hazra. The explosion of petascale in the race to exascale. International Supercomputing Conference, 2012. [30] Rajeeb Hazra. Driving industrial innovation on the path to exascale: From vision to reality. InternationalSupercomputing Conference, 2013. [31] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson. A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS. In Solid-State Circuits Conference, 2010. ISSCC 2010. Digest of Technical Papers. IEEE International,February 2010. [32] Wilson C. Hsieh, Paul Wang, and William E. Weihl. Computation migration: enhancing locality for distributed-memory parallel systems. In PPOPP,1993. [33] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA substrate for flexible CMP cache sharing. In ICS, 2005. [34] Jose A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. Bottleneck identification and scheduling in multithreaded applications. In ASPLOS, 2012. [35] Rajeev Joshi, Leslie Lamport, John Matthews, Serdar Tasiran, Mark Tuttle, and Yuan Yu. Checking cache-coherence protocols with tla+. Formal Methods in System Design, 22:125-131, 2003. [36] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An Adaptive, NonUniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In ASPL OS, 2002. 103 [37] Theodoros Konstantakopulos, Jonathan Eastep, James Psota, and Anant Agarwal. Energy scalability of on-chip interconnection networks in multicore architectures. MIT-CSA IL- TR-2008-066, 2008. [38] Amit Kumar, Partha Kundu, Arvind Singh, Li-Shiuan Peh, and Niraj K. Jha. A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in 65nm CMOS. In ICCD, 2008. [39] Daniel E. Lenoski and Wolf-Dietrich Weber. Scalable Shared-memory Multiprocessing. Morgan Kaufmann, 1995. [40] Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Christopher W. Fletcher, Michel Kinsy, Ilia Lebedev, Omer Khan, and Srinivas Devadas. Brief announcement: Distributed shared memory based on computation migration. In SPAA, 2011. [41] Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Omer Khan, and Srinivas Devadas. Directoryless shared memory coherence using execution migration. In PDCS, 2011. [42] P. Michaud. Exploiting the cache capacity of a single-chip multicore processor with execution migration. In HPCA, 2004. [43] Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. Graphite: A distributed parallel simulator for multicores. In HPCA, 2010. [44] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), April 1965. [45] George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu. Next generation on-chip networks: what kind of congestion control do we need? In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, page 12. ACM, 2010. [46] J.D. Owens, W.J. Dally, R. Ho, D. N. Jayasimha, S.W. Keckler, and Li-Shiuan Peh. Research challenges for on-chip interconnection networks. Micro, IEEE, 27(5):96-108, Sept 2007. [47] Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubhendu S. Mukherjee. Architectural core salvaging in a multi-core processor for hard-error tolerance. In ISCA, 2009. [48] Adapteva Products. Epiphany-iv 64-core 28nm microprocessor, 2012. [49] Krishna K. Rangan, Gu-Yeon Wei, and David Brooks. Thread motion: Finegrained power management for multi-core systems. In ISCA, 2009. [50] Stefan Rusu, Simon Tam, Harry Muljono, Jason Stinson, David Ayers, Jonathan Chang, Raj Varada, Matt Ratta, and Sailesh Kottapalli. A 45nm 8-core enterprise Xeon@ processor. In ISSCC, pages 56-57. IEEE, 2009. 104 [51] D. Sanchez and C. Kozyrakis. Sed: A scalable coherence directory with flexible sharer set encoding. In High Performance Computer Architecture (HPCA), 2012 IEEE 18th InternationalSymposium on, pages 1-12, Feb 2012. [52] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Doug Burger, Stephen W. Keckler, and Charles R. Moore. Exploiting ilp, tip, and dlp with the polymorphous trips architecture. In Proceedings of the 30th Annual InternationalSymposium on Computer Architecture, ISCA '03, pages 422-433, New York, NY, USA, 2003. ACM. [53] Keun Sup Shim, Mieszko Lis, Myong Hyon Cho, Ilia Lebedev, and Srinivas Devadas. Design Tradeoffs for Simplicity and Efficient Verification in the Execution Migration Machine. In Proceedings of the Int'l Conference on Computer Design, October 2013. [54] Keun Sup Shim, Mieszko Lis, Omer Khan, and Srinivas Devadas. Thread migration prediction for distributed shared caches. Computer Architecture Letters, Sep 2012. [55] Angela C. Sodan. Message-Passing and Shared-Data Programming Models-Wish vs. Reality. In High Performance Computing Systems and Applications, 2005. HPCS 2005. 19th InternationalSymposium on, pages 131-139, May 2005. [56] B. Stackhouse, B. Cherkauer, M. Gowan, P. Gronowski, and C. Lyles. A 65nm 2billion-transistor quad-core itanium processor. In Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International,pages 92-598, Feb 2008. [57] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS. IEEE J. Solid-State Circuits, 43:29-41, 2008. [58] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it all to Software: Raw Machines. In IEEE Computer, pages 86-93, September 1997. [59] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards, Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant Agarwal. On-Chip Interconnection Architecture of the Tile Processor. IEEE Micro, 27:15-31, September 2007. [60] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: characterization and methodological considerations. In ISCA, 1995. [61] D. Yeh, Li-Shiuan Peh, S. Borkar, J. Darringer, A. Agarwal, and Wen-Mei Hwu. Thousand-core chips [roundtable]. Design Test of Computers, IEEE, 25(3):272-278, May 2008. 105 [62] M. Zhang and K. Asanovid. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. In ISCA, 2005. [63] Meng Zhang, Alvin R. Lebeck, and Daniel J. Sorin. Fractal coherence: Scalably verifiable cache coherence. In MICRO, 2010. 106 Appendix A Source-level Read-only Data Replication The directoryless design, which is used for both the remote-access-only baseline and our proposed architecture, does not allow replication for any kinds of data at the hardware level. Read-only data, however, can actually be replicated without breaking cache coherence even without directories and a coherence protocol. While we believe that the detection and replication of such data can be done fairly straightforwardly by a compiler, automating the replication of these data is out of the scope of this paper. Instead, we achieve this strictly at the source-level; data can be replicated permanently for globally shared read-only data, and similarly, read-only data in a limited scope (although globally read-write shared) can also be easily replicated. For example, several matrix transformation algorithms contain at their heart the pattern shown in the following pseudocode: barrier(); for (...) ... { D1 = D2 + D3; ... } barrier 0; where D1 "belongs" to the running thread but D2 and D3 are owned by other threads and stored on other cores; this induces a pattern where the thread must perform remote accesses to load D2 and D3 for every loop. Instead, during time periods when shared data is read many times by several threads and not written, we can make temporary local copies of the data and compute using the local copies: 107 barrier 0; // copy D2 and D3 to local L2, L3 for (...) ... { D1 = L2 + L3; ... } barrier(); Since these local copies are guaranteed to be only read within the barriers by the programmer, there is no need to invalidate replicated data afterwards. With these optimizations, we modified a set of SPLASH-2 benchmarks (FFT, LU, OCEAN, RADIX, RAYTRACE, and WATER) in order to reduce core miss rate under the directoryless architecture. Although we only describe our modifications for LU and WATER here, we have applied the same techniques for the rest of the benchmarks. LU : In the original version optimized for cache coherence (LUCONTIGUOUS), which we used as a starting point for optimization, the matrix to be operated on is divided into multiple blocks in such a way that all data points in a given block-which are operated on by the same thread-are allocated contiguously. Each block is also already page-aligned, as shown below: Global matrix **a Block 0 *p0 *pI Block *p2 1 *p3 Block 2 ... Block 3 Blocks are page-aligned Therefore, no data restructuring is required to reduce false sharing. During each computation phase, however, each thread repeatedly reads blocks owned by other threads, but writes only its own thread; e.g., in the LU source code snippet for (k=O; k<dimk; k++) for (j=0; j<dimj; { j++) { alpha = -b[k+j*strideb]; for (i=O; i<dimi; i++) c[i+j*stridecl += alpha*a[i+k*stridea; } } 108 since the other threads' blocks (a and b) are mapped to different cores than the current thread's own block (c), nearly every access triggers a core miss. Since blocks a and b are read-only data within this function and the contents are not updated by other threads in the scope, we can apply the method of limited local replication. In the modified version, a thread copies the necessary blocks-a and b in the example above-to local variables (which are also page-aligned to avoid false-sharing); the computation then only accesses local copies, eliminating core miss accesses once the replication is done. We similarly replicate global read-only data such as the number of threads, matrix size, and the number of blocks per thread. WATER : In the original code, the main data structure (VAR) is a ID array of molecules to be simulated, and each thread is assigned a portion of this array to work on: MOL *VAR 0 MOL I MOL 2 MOL 3e e Thread Thread 0 1 The problem with this data structure is that, as all molecules are allocated contiguously, molecules processed by different threads can share the same page and this false sharing can induce unnecessary memory accesses to remote caches. To address this, we modify the VAR data structure as follows: *p0 **VAR M2OL 0] *p2 *pI1 M2OL 1] *p3 M2OL 2ML3 Molecules are page-aligned By recasting VAR as an array of pointers, we can page-align all of the molecules, entirely eliminating false-sharing among them; this guarantees that molecules assigned to a particular thread are mapped to the core where the thread executes. In addition, WATER can also be optimized by locally replicating read-only data. For each molecule, the thread computes some intermolecular distances to other molecules, which requires read accesses to the molecules owned by other threads: CSHIFT() { = XMA-XB[2]; XL[01 = XMA-XMB; XL[11 = XMA-XB[O]; XL[2] XL[31 = XA[O]-XMB; XL[41 = XA[2]-XMB; XL[51 = XA[O-XB[O]; 109 Number of total code lines of Number changed code lines FFT LU OCEAN RADIX RAYTRACE WATER-NSQ 701 732 3817 662 5461 1192 21 38 30 27 46 98 Table A.1: The total number of changed code lines XL[6] = XA[0]-XB[21; XL[7] = XA[2]-XB[0]; ... } Here, XMB and XB are parts of molecules owned by other threads, while XMA, XA, and XL belong to the thread that calls this function. Since all threads are synchronized before and after this step, and the other threads' molecules are not updated, we can safely make a read-only copy in the local memory of the caller thread. Thus, after initially copying XMB and XB to thread-local data, the remainder of the computation induces no further core misses. Table A.1 shows that the total number of modified/added (code changes) lines of code for each benchmark due to this source-level replication is small1 . These modified benchmarks allow us to extrapolate the benefits that can be obtained by replicating data that need no coherence on the directoryless architecture, and also to compare the performance of the remote-access-only baseline and our hybrid scheme on top of replication support. It is important to note that both directoryless architectures benefit from this replication. 'Our count excludes comments and blank lines in the code. Our modifications were strictly source-level, and did not alter the algorithm used. 110