Directoryless Shared Memory Architecture using Shim

advertisement
Directoryless Shared Memory Architecture using
Thread Migration and Remote Access
by
Keun Sup Shim
Bachelor of Science, Electrical Engineering and Computer Science,
KAIST, 2006
Master of Science, Electrical Engineering and Computer Science,
Massachusetts Institute of Technology, 2010
Submitted to the Department of Electrical Engineering and Computer
Science
A60{NEq
in partial fulfillment of the requirements for the degree of MASSACHUSETTS INS
OF TECHNOLOGY
Doctor of Philosophy
JUN
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
L RARIES
June 2014
@
Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
A uth or .................................
Department of Electrical Engineering and Computer Science
(.VIay 11, 2014
g
Certified by...........................Signature
redacted
Srinivas Devadas
Edwin Sibley Webster Professor
Thesis Supervisor
Accepted by......................
Signature redacted_
r d ce
/eslie /A.
2
olodziejski
Chair, Department Committee on Graduate Students
rT0 E1
2
Directoryless Shared Memory Architecture using
Thread Migration and Remote Access
by
Keun Sup Shim
Submitted to the Department of Electrical Engineering and Computer Science
on May 14, 2014, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
Chip multiprocessors (CMPs) have become mainstream in recent years, and, for
scalability reasons, high-core-count designs tend towards tiled CMPs with physically
distributed caches. In order to support shared memory, current many-core CMPs
maintain cache coherence using distributed directory protocols, which are extremely
difficult and error-prone to implement and verify. Private caches with directory-based
coherence also provide suboptimal performance when a thread accesses large amounts
of data distributed across the chip: the data must be brought to the core where the
thread is running, incurring delays and energy costs. Under this scenario, migrating a
thread to data instead of the other way around can improve performance.
In this thesis, we propose a directoryless approach where data can be accessed
either via a round-trip remote access protocol or by migrating a thread to where data
resides. While our hardware mechanism for fine-grained thread migration enables
faster migration than previous proposals, its costs still make it crucial to use thread
migrations judiciously for the performance of our proposed architecture. We, therefore,
present an on-line algorithm which decides at the instruction level whether to perform
a remote access or a thread migration. In addition, to further reduce migration costs,
we extend our scheme to support partial context migration by predicting the necessary
thread context. Finally, we provide the ASIC implementation details as well as RTL
simulation results of the Execution Migration Machine (EM 2 ), a 110-core directoryless
shared-memory processor.
Thesis Supervisor: Srinivas Devadas
Title: Edwin Sibley Webster Professor
3
4
Acknowledgments
First and foremost, I would like to express my deepest gratitude to my advisor,
Professor Srinivas Devadas, who has offered me full support and has been a tremendous
mentor throughout my Ph.D. years. I feel very fortunate to have had the opportunity
to work with him and learn from him. His energy and insight will continue to inspire
me throughout my career.
I would also like to thank my committee members Professor Arvind and Professor
Daniel Sanchez. They both provided me with invaluable feedback and advice that
helped me to develop my thesis more thoroughly. I am especially grateful to Arvind for
being accessible as a counselor as well, and to Daniel for always being an inspiration
to me for his passion in this field. I truly thank another mentor of mine, Professor Joel
Emer. From Joel, I learned not only about the core concepts of computer architecture
but also about teaching. I feel very privileged for having been a teaching assistant for
his class.
While I appreciate all of my fellow students in the Computation Structures Group
at MIT, I want to express special thanks to Mieszko Lis and Myong Hyon Cho. We
were great collaborators on the EM2 tapeout project, and at the same time, awesome
friends during our doctoral years. It was a great pleasure for me to work with such
talented and fun people.
Getting through my dissertation required more than academic support. Words
cannot express my gratitude and appreciation to my friends from Seoul Science High
School and KAIST at MIT. I am also grateful to my friends at Boston Onnuri Church
for their prayers and encouragement. I would also like to extend my deep gratitude to
Samsung Scholarship for supporting me financially during my doctoral study.
My fianc6e Song-Hee deserves my special thanks for her love and care. She has
believed in me more than I did myself and her consistent support has always kept me
energized and made me feel that I am never alone. I cannot thank my parents and
family enough; they have always believed in me, and have been behind me throughout
my entire life. Lastly, I thank God, for offering me so many opportunities in my life
5
and giving me the strength and wisdom to fully enjoy them.
6
Contents
1
2
3
Introduction
17
1.1
Large-Scale Chip Multiprocessors
1.2
Shared Memory for Large-Scale CMPs
. . . . . . . . . . . . . . . . .
18
1.3
Motivation for Fine-grained Thread Migration . . . . . . . . . . . . .
19
1.4
Motivation for Directoryless Architecture . . . . . . . . . . . . . . . .
21
1.5
Previous Works on Thread Migration . . . . . . . . . . . . . . . . . .
22
1.6
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
. . . . . . . . . . . . . . . . . . . .
17
Directoryless Architecture
27
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.2
Remote Cache Access . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2.3
Hardware-level Thread Migration
29
2.4
Performance Overhead of Thread Migration
2.5
Hybrid Memory Access Framework
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
30
. . . . . . . . . . . . . . . . . . .
31
Thread Migration Prediction
33
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.2
Thread Migration Predictor
33
3.3
. . . . . . . . . . . . . . . . . . . . . . .
3.2.1
Per-core Thread Migration Predictor
3.2.2
Detecting Migratory Instructions: WHEN to migrate
. . . . .
35
3.2.3
Possible Thrashing in the Migration Predictor . . . . . . . . .
38
Experim ental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.1
40
Application Benchmarks
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
7
33
3.3.2
3.4
4
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Perform ance . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Chapter Summary
44
. . . . . . . . . . . . . . . . . . . . . . . . . . ..
Partial Context Migration for General Register File Architecture
47
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.2
Partial Context Thread Migration . . . . . . . . . . . . . . . . . . . .
48
. . . . . . . . . . . . . . . . .
48
. . . . . . .
49
4.3
4.2.1
Extending Migration Predictor
4.2.2
Detection of Useful Registers: WHAT to migrate
4.2.3
Partial Context Migration Policy
. . . . . . . . . . . . . . . .
51
4.2.4
Misprediction handling . . . . . . . . . . . . . . . . . . . . . .
53
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
. . . . . . . . . . . . . . . . . . . . . . . .
54
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
. . . . . . . . . . . . . . . .
54
. . . . . . . . . . . . . . .
57
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.3.1
4.4
4.5
5
42
Simulation Results
3.4.1
3.5
. . . . . . . . . . . . . . . . . . . . . . . .
Evaluated Systems
Evaluated Systems
Simulation Results
4.4.1
Performance and Network Traffic
4.4.2
The Effects of Network Parameters
Chapter Summary
61
The EM' silicon implementation
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
5.2
EM2 Processor
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
5.2.1
System architecture . . . . . . . . . . . . . . . . . . . . . . . .
62
5.2.2
Tile architecture
. . . . . . . . . . . . . . . . . . . . . . . . .
64
5.2.3
Stack-based core architecture
. . . . . . . . . . . . . . . . . .
64
5.2.4
Thread migration implementation . . . . . . . . . . . . . . . .
65
5.2.5
The instruction set . . . . . . . . . . . . . . . . . . . . . . . .
67
5.2.6
System configuration and bootstrap . . . . . . . . . . . . . . .
69
5.2.7
Virtual memory and OS implications
. . . . . . . . . . . . . .
70
5.3
Migration Predictor for EM 2
5.3.1
. . . . . . . . . . . . . . . . . . . . . .
Stack-based Architecture variant
8
. . . . . . . . . . . . . . . .
.
70
70
5.4
5.5
5.6
5.7
6
5.3.2
Partial Context Migration Policy
5.3.3
Implementation Details . . . . .
. . . . . . . . . . . .
74
Physical Design of the EM 2 Processor .
. . . . . . . . . . . .
76
5.4.1
Overview
. . . . . . . . . . . .
. . . . . . . . . . . .
76
5.4.2
Tile-level . . . . . . . . . . . . .
. . . . . . . . . . . .
76
5.4.3
Chip-level . . . . . . . . . . . .
. . . . . . . . . . . .
78
. . . . . . . . . .
. . . . . . . . . . . .
79
5.5.1
RTL simulation . . . . . . . . .
. . . . . . . . . . . .
79
5.5.2
Area and power estimates
. . .
. . . . . . . . . . . .
81
Evaluation . . . . . . . . . . . . . . . .
. . . . . . . . . . . .
82
5.6.1
Performance tradeoff factors . .
. . . . . . . . . . . .
82
5.6.2
Benchmark performance
. . . .
. . . . . . . . . . . .
84
5.6.3
Area and power costs . . . . . .
. . . . . . . . . . . .
91
5.6.4
Verification Complexity
. . . .
. . . . . . . . . . . .
92
. . . . . . . . . . .
. . . . . . . . . . . .
94
Evaluation Methods
Chapter Summary
73
Conclusions
97
6.1
Thesis contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . .
97
6.2
Architectural assumptions and their implications . . . . . . . . . . . .
98
6.3
Future avenues of research
. . . . . . . . . . . . . . . . . . . . . . . .
100
Bibliography
101
A Source-level Read-only Data Replication
107
9
10
List of Figures
1-1
Rationale of moving computation instead of data
. . . . . . . . . . .
20
2-1
Hardware-level thread migration via the on-chip interconnect . . . . .
31
2-2
Hybrid memory access framework for our directoryless architecture
.
32
3-1
Hybrid memory access architecture with a thread migration predictor
. . . . . . . . . . . . . . . . . . . . . . . .
34
on a 5-stage pipeline core.
3-2
An example of how instructions (or PC's) that are followed by consecutive accesses to the same home location (i.e., migratory instructions)
are detected in the case of the depth threshold
3-3
0 = 2. . . . . . . . . .
36
An example of how the decision between remote access and thread
migration is made for every memory access.
. . . . . . . . . . . . . .
38
3-4
Parallel K-fold cross-validation using perceptron . . . . . . . . . . . .
40
3-5
Core miss rate and its breakdown into remote access rate and migration
rate
3-6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Parallel completion time normalized to the remote-access-only architec-
ture (N oDirRA ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
3-7
Network traffic normalized to the remote-access-only architecture (NoDirRA) 45
4-1
Hardware-level thread migration with partial context migration support
4-2
A per-core PC-based migration predictor, where each entry contains a
{PC, register mask} pair . . . . . . . . . . . . . . . . . . . . . . . . .
11
48
49
4-3
An example how registers being read/written are kept track of and how
the information is inserted into the migration predictor when a specific
instruction (or PC) is detected as a migratory instruction (the depth
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4-4
An example of a partial context thread migration. . . . . . . . . . . .
52
4-5
Parallel completion time normalized to DirCC . . . . . . . . . . . . .
55
4-6
Network traffic normalized to DirCC . . . . . . . . . . . . . . . . . .
56
4-7
Breakdown of Li miss rate . . . . . . . . . . . . . . . . . . . . . . . .
57
4-8
Core miss rate for directoryless systems . . . . . . . . . . . . . . . . .
58
4-9
Network traffic breakdown . . . . . . . . . . . . . . . . . . . . . . . .
59
4-10 Breakdown of migrated context into used and unused registers . . . .
60
threshold 0 = 2).
4-11 The effect of network latency and bandwidth on performance and
network traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
5-1
Chip-level layout of the 110-core EM 2 chip . . . . . . . . . . . . . . .
62
5-2
EM 2 Tile Architecture
63
5-3
The stack-based processor core diagram of EM 2. . . . . . . . . . . ..
5-4
Hardware-level thread migration via the on-chip interconnect under
. . . . . . . . . . . . . . . . . . . . . . . . . .
64
EM 2 . Only the main stack is shown for simplicity. . . . . . . . . . . .
66
5-5
The two-stage scan chain used to configure the EM 2 chip . . . . . . .
69
5-6
Integration of a PC-based migration predictor into a stack-based, twostage pipelined core of EM2.
. . . . . . . . . . . . .
. . .
. .
. .
. . .
71
. . . . . . .
73
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
Die photo of the 110-core EM 2 chip . . . . . . . . . . . . . . . . . . .
79
5-10 Thread migration (EM 2 ) vs Remote access (RA) . . . . . . . . . . . .
82
5-11 Thread migration (EM 2 ) vs Private caching (CC) . . . . . . . . . . .
83
5-12 The effect of distance on RA, CC and EM 2 . . . . . . . . . . . . . . . .
84
5-13 The evaluation of EM2 . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
5-14 Thread migration statistics under EM 2 . . . . . . . . . . . . . . . . .
86
5-7
Decision/Learning mechanism of the migration predictor
5-8
EM 2 Tile Layout
5-9
12
5-15 Performance and network traffic with different number of threads for
tbscan under EM 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
88
5-16 N instructions before being evicted from a guest context under EM 2 .
88
5-17 EM 2 allows efficient bulk loads from a remote core.
90
. . . . . . . . . .
5-18 Relative area and leakage power costs of EM 2 vs. estimates for exactsharer CC with the directory sized to 100% and 50% of the D$ entries
(DC Ultra, IBM 45nm SOI hvt library, 800MHz).
5-19 Bottom-up verification methodology of EM 2
13
. . . . . . . . . . .
. . . . .
. . . . . . . .. . . . . .
92
93
14
List of Tables
3.1
System configurations used . . . . . . . . . . . . . . . . . . . . . . . .
39
5.1
Interface ports of the migration predictor in EM 2 . . . . . . . . . . . .
75
5.2
Power estimates of the EM 2 tile (reported by Design Compiler)
78
5.3
A summary of architectural costs that differ in the EM 2 and CC implemen-
. . .
tation s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.1
The total number of changed code lines . . . . . . . . . . . . . . . . .
15
91
110
16
Chapter 1
Introduction
1.1
Large-Scale Chip Multiprocessors
For the past decades, CMOS scaling has been a driving force of computer performance
improvements. The number of transistors on a single chip has doubled roughly every
18 months (known as Moore's law [44]), and along with Dennard scaling [21], we could
improve the processor performance without hitting the power wall [21]. Starting from
the mid-2000's, however, supply voltage scaling has stopped due to higher leakage,
and power limits have halted the drive to higher core frequencies.
Unlike the end of Dennard scaling, transistor density has continued to grow [25].
As increasing instruction-level parallelism (ILP) of a single-core processor became less
efficient, computer architects have turned to multicore architectures rather than more
complex uniprocessor architectures to better utilize the available transistors for overall
performance. And since 2005 when we have had dual-core processors in the market,
Chip Multiprocessors (CMPs) with more than one core on a single chip have already
become common in the commodity and general-purpose processor markets [50,56].
To further improve performance, architects are now resorting to medium and largescale multicores. In addition to multiprocessor projects in academia (e.g., RAW [58],
TRIPS [52]), Intel demonstrated its 80-tile TeraFLOPS research chip in 65-nm CMOS
in 2008 [57], followed by the 48-core SCC processor in 45-nm technology, the second
processor in the TeraScale Research program [31]. In 2012, Intel introduced its first
17
Many Integrated Core (MIC) product which has over 60 cores to the market as the Intel
Xeon Phi family [29], and it has recently announced a 72-core x86 Knights Landing
CPU [30]. Tilera Corporation has shipped its first multiprocessor, TILE64 [7, 59],
which connects 64 tiles with 2-D mesh networks, in 2007; the company has further
announced TILE-Gx72 which implements 72 power-efficient processor cores and is
suited for many compute and I/O-intensive applications [17]. Adapteva also announced
its 64-core 28-nm microprocessor based on its Epiphany architecture which supports
shared memory and uses a 2D mesh network [48].
As seen by many examples, processor manufacturers are already able to place tens
and hundreds of cores on a single chip, and industry pundits are predicting 1000 or
more cores in a few years [2,8,61].
1.2
Shared Memory for Large-Scale CMPs
For manycore CMPs, each core typically has per-core Li and L2 caches since power
requirements of caches grow quadratically with size; therefore, the only practical option
to implement a large on-chip cache is to physically distribute cache on the chip so that
every core is near some portion of the cache [7,29]. And since conventional bus and
crossbar interconnects no longer scale due to the bandwidth and area limitations [45,46],
these cores are often connected via an on-chip interconnect, forming a tiled architecture
(e.g., Raw [58], TRIPS [52], Tilera [7], Intel TeraFLOPS [57], Adapteva [48]).
How will these manycore chips be programmed? Programming convenience provided by the shared memory abstraction has made it the most popular paradigm for
general-purpose parallel programming. While architectures with restricted memory
models (most notably GPUs) have enjoyed immense success in specific applications
(such as rendering graphics), most programmers prefer a shared memory model [55],
and commercial general-purpose multicores have supported this abstraction in hardware. The main question, then, is how to efficiently provide coherent shared memory
on the scale of hundreds or thousands of cores.
Providing a full shared-memory abstraction requires cache coherence, which is
18
traditionally implemented by bus-based snooping or a centralized directory for CMPs
with relatively few cores. For large-scale CMPs where bus-based mechanisms fail,
however, snooping and centralized directories are no longer viable, and such many-core
systems commonly provide cache coherence via distributed directory protocols. A
logically central but physically distributed directory coordinates sharing among the
per-core caches, and each core cache must negotiate shared (read-only) or exclusive
(read/write) access to each cache line via a coherence protocol. The use of directories
poses its own challenges, however. Coherence traffic can be significant, which increases
interconnect power, delay, and congestion; the performance of applications can suffer
due to long latency between directories and requestors especially, for shared read/write
data; finally, directory sizes must equal a significant portion of the combined size
of the per-core caches, as otherwise directory evictions will limit performance [27].
Although some recent works propose more scalable directories or coherence protocols
in terms of area and performance [16,18, 20,24,51], the scalability of directories to a
large number of cores still remains an arguably critical challenge due to the design
complexity, area overheads, etc.
1.3
Motivation for Fine-grained Thread Migration
Under tiled CMPs, each core has its own cache slice and the last-level cache can
be implemented either as private or shared; while the trade-offs between the two
have been actively explored [12,62], many recent works have organized physically
distributed L2 cache slices to form one logically shared L2 cache, naturally leading to
a Non-Uniform Cache Access (NUCA) architecture [4,6,13,15,28,33,36].
And when
large data structures that do not fit in a single cache are shared by multiple threads
or iteratively accessed even by a single thread, the data are typically distributed
across these multiple shared cache slices to minimize expensive off-chip accesses. This
raises the need for a thread to access data mapped at remote caches often with high
spatio-temporal locality, which is prevalent in many applications; for example, a
database request might result in a series of phases, each consisting of many accesses
19
to contiguous stretches of data.
Chunk 1
Chunk 4Chunk3Chk4
(a) Directory-based / RA-only
(b) Thread migration
Figure 1-1: Rationale of moving computation instead of data
In a manycore architecture without efficient thread migration, this pattern results
in large amounts of on-chip network traffic.
Each request will typically run in a
separate thread, pinned to a single core throughout its execution. Because this thread
might access data cached in last-level cache slices located in different tiles, the data
must be brought to the core where the thread is running. For example, in a directorybased architecture, the data would be brought to the core's private cache, only to be
replaced when the next phase of the request accesses a different segment of data (see
Figure 1-1a).
If threads can be efficiently migrated across the chip, however, the on-chip data
movement-and with it, energy use-can be significantly reduced; instead of transferring data to feed the computing thread, the thread itself can migrate to follow
the data. When applications exhibit data access locality, efficient thread migration
can turn many round-trips to retrieve data into a series of migrations followed by
long stretches of accesses to locally cached data (see Figure 1-1b). And if the thread
context is small compared to the data that would otherwise be transferred, moving
the thread can be a huge win. Migration latency also needs to be kept reasonably
low, and we argue that these requirements call for a simple, efficient hardware-level
implementation of thread migration at the architecture level.
20
1.4
Motivation for Directoryless Architecture
As described in Chapter 1.2, private Li caches need to maintain cache coherence
to support shared memory, which is commonly done via distributed directory-based
protocols in modern large-scale CMPs. One barrier to distributed directory coherence
protocols, however, is that they are extremely difficult to implement and verify [35].
The design of even a simple coherence protocol is not trivial; under a coherence
protocol, the response to a given request is determined by the state of all actors in the
system, transient states due to indirections (e.g., cache line invalidation), and transient
states due to the nondeterminism inherent in the relative timing of events. Since the
state space explodes exponentially as the distributed directories and the number of
cores grow, it is virtually impossible to cover all scenarios during verification either
by simulation or by formal methods [63]. Unfortunately, verifying small subsystems
does not guarantee the correctness of the entire system [3]. In modern CMPs, errors
in cache coherence are one of the leading bug sources in the post-silicon debugging
phase [22].
A straightforward approach to removing directories while maintaining cache coherence is to disallow cache line replication across on-chip caches (even Li caches)
and use remote word-level access to load and store remotely cached data [23]: in this
scheme, every access to an address cached on a remote core becomes a two-message
round trip. Since only one copy is ever cached, coherence is trivially ensured. Such a
remote-access-only architecture, however, is still susceptible to data access patterns as
shown in Figure 1-la; each request to non-local data would result in a request-response
pair sent across the on-chip interconnect, incurring significant network traffic and
performance degradation.
As a new design point, therefore, we propose a directoryless architecture which
better exploits data locality by using fine-grained hardware-level thread migration to
complement remote accesses [14,41]. In this approach, accesses to data cached at a
remote core can also cause the thread to migrate to that core and continue execution
there. When several consecutive accesses are made to data at the same core, thread
21
migration allows those accesses to become local, potentially improving performance
over a remote-access regimen.
Migration costs, however, make it crucial to migrate only when multiple remote
accesses would be replaced to make the cost "worth it." Moreover, since only a few
registers are typically used between the time the thread migrates out and returns,
transfer costs can be reduced by not migrating the unused registers. In this thesis, we
especially focus on how to make judicious decisions on whether to perform a remote
access or to migrate a thread, and how to further reduce thread migration costs by
only migrating the necessary thread context.
1.5
Previous Works on Thread Migration
Migrating computation to accelerate data access is not itself a novel idea. Hector
Garcia-Molina in 1984 introduced the idea of moving processing to data in memory
bound architectures [26], and improving memory access latency via migration has been
proposed using coarse-grained compiler transformations [32]. In recent years migrating
execution context has re-emerged in the context of single-chip multicores. Michaud
showed that execution migration can improve the overall on-chip cache capacity
and selectively migrated sequential programs to improve cache performance [42].
Computation spreading [11] splits thread code into segments and migrates threads
among cores assigned to the segments to improve code locality.
In the area of reliability, Core salvaging [47] allows programs to run on cores with
permanent hardware faults provided they can migrate to access the locally damaged
module at a remote core.
In design-for-power, Thread motion [49] migrates less
demanding threads to cores in a lower voltage/frequency domain to improve the overall
power/performance ratios. More recently, thread migration among heterogeneous
cores has been proposed to improve program bottlenecks (e.g., locks) [34].
Moving thread execution from one processor to another has long been a common
feature in operating systems. The 02 scheduler [9], for example, improves memory
performance in distributed-memory multicores by trying to keep threads near their
22
data during OS scheduling. This OS-mediated form of migration, however, is far too
slow to make migrating threads for more efficient cache access viable: just moving the
thread takes many hundreds of cycles at best (indeed, OSes generally avoid rebalancing
processor core queues when possible). In addition, commodity processors are simply
not designed to support migration efficiently: while context switch time is a design
consideration, the very coarse granularity of OS-driven thread movement means that
optimizing for fast migration is not.
Similarly, existing descriptions of hardware-level thread migration do not focus
primarily on fast, efficient migrations. Thread Motion [49], for example, uses special
microinstructions to write the thread context to the cache and leverages the underlying
MESI coherence protocol to move threads via the last-level cache. The considerable onchip traffic and delays that result when the coherence protocol contacts the directory,
invalidates sharers, and moves the cache line, is acceptable for the 1000-cycle granularity
of the centralized thread balancing logic, but not for the fine-grained migration at the
instruction level which is the focus of this thesis. Similarly, hardware-level migration
among cores via a single, centrally scheduled pool of inactive threads has been described
in a four-core CMP [10]; designed to hide off-chip DRAM access latency, this design
did not focus on migration efficiency, and, together with the round-trips required for
thread-swap requests, the indirections via a per-core spill/fill buffer and the central
inactive pool make it inadequate for the fine-grained migration needed to access remote
caches.
1.6
Contributions
The specific contributions of this dissertation are as follows:
1. A directoryless architecture which supports fine-grained hardwarelevel thread migration to complement remote accesses
(Chapter 2).
Although thread (or process) movement has long been a common OS feature,
the millisecond granularity makes this technique unsuitable for taking advantage
of shorter-lived phenomena like fine-grained memory access locality. Based
23
on our pure hardware implementation of thread migration, we introduce a
directoryless architecture where data mapped on a remote core can be accessed
via a round-trip remote access protocol or by migrating a thread to where data
resides.
2. A novel migration prediction mechanism which decides at instruction
granularity whether to perform a remote access or a thread migration
(Chapter 3). Due to high migration costs, it is crucial to use thread migrations
judiciously under the proposed directoryless architecture. We, therefore, present
an on-line algorithm which decides at the instruction level whether to perform a
remote access or a thread migration.
3. Partial context thread migration to reduce migration costs (Chapter 4). We observe that not all the architectural registers are used while a
thread is running on the migrated core, and therefore, always moving the entire
thread context upon thread migrations is wasteful. In order to further cut down
the cost of thread migration, we extend our prediction scheme to support partial
context migration, a novel thread migration approach that only migrates the
necessary part of the architectural state.
4. The 110-core Execution Migration Machine (EM 2 )-the silicon implementation to support hardware-level thread migration in a 45nm
ASIC (Chapter 5). We provide the salient physical implementation details
of our silicon prototype of the proposed architecture built as a 110-core CMP,
which occupies 100mm 2 in 45nm ASIC technology. The EM 2 chip adopts the
stack-based core architecture which is best suited for partial context migration,
and it also implements the stack-variant migration predictor. We also present
detailed evaluation results of EM 2 using the RTL-level simulation of several
benchmarks on a full 110-core chip.
Chapter 6 concludes the thesis with a summary of the major findings and suggestions
for future avenues of research.
24
Relation to other publications.
This thesis extends and summarizes prior publi-
cations by the author and others. The deadlock-free fine-grained thread migration
protocol was first presented in [14], and a directoryless architecture using this thread
migration framework with remote access (cf. Chapter 2) was introduced in [40,41].
While these papers do not address deciding between migrations and remote accesses
for each memory access, Chapter 3 subsumes the description of a migration predictor
presented in [54].
The work is extended in Chapter 4 to support partial context
migration by learning and predicting the necessary thread context. In terms of the
EM 2 chip, the tapeout process was in collaboration with Mieszko Lis and Myong
Hyon Cho, and the evaluation results of the RTL simulation in Chapter 5 were joint
with Mieszko Lis; some of these contents, therefore, will also appear or has appeared
in their theses. The physical implementation details of EM 2 and our chip design
experience can also be found in [53].
25
26
Chapter 2
Directoryless Architecture
2.1
Introduction
For scalability reasons, large-scale CMPs (> 16 cores) tend towards a tiled architecture
where arrays of replicated tiles are connected over an on-chip interconnect [7,52,58].
Each tile contains a processor with its own Li cache, a slice of the L2 cache, and a
router that connects to the on-chip network. To maximize effective on-chip cache
capacity and reduce off-chip access rates, physically distributed L2 cache slices form
one large logically shared cache, known as Non-Uniform Cache Access (NUCA)
architecture [13,28,36].
Under this Shared L2 organization of NUCA designs, the
address space is divided among the cores in such a way that each address is assigned
to a unique home core where the data corresponding to the address can be cached
at the L2 level. At the Li level, on the other hand, data can be replicated across
any requesting core since current CMPs use Private Li caches. Coherence at the Li
level is maintained via a coherence protocol and distributed directories, which are
commonly co-located with the shared L2 slice at the home core.
To completely obviate the need for complex protocols and directories, a directoryless
architecture extends the shared organization to Li caches-a cache line may only
reside in its home core even at the Li level [23]. Because only one copy is ever cached,
cache coherence is trivially ensured.
To read and write data cached in a remote
core, the directoryless architectures proposed and built so far use a remote access
27
mechanism wherein a request is sent to the home core and the resulting data (or
acknowledgement) is sent back to the requesting core.
In what follows, we describe this remote access protocol, as well as a protocol
based on hardware-level thread migration where instead of making a round-trip remote
access the thread simply moves to the core where the data resides. We then present a
framework that combines both.
2.2
Remote Cache Access
Under the remote-access framework of directoryless designs [23, 36], all non-local
memory accesses cause a request to be transmitted over the interconnect network, the
access to be performed in the remote core, and the data (for loads) or acknowledgement
(for writes) to be sent back to the requesting core: when a core C executes a memory
access for address A, it must
1. find the home core H for A (e.g., by consulting a mapping table or masking
some address bits);
2. if H = C (a core hit),
(a) forward the request for A to the cache hierarchy (possibly resulting in a
DRAM access);
3. if H
#
C (a core miss),
(a) send a remote access request for address A to core H;
(b) when the request arrives at H, forward it to H's cache hierarchy (possibly
resulting in a DRAM access);
(c) when the cache access completes, send a response back to C;
(d) once the response arrives at C, continue execution.
Note that, unlike a private cache organization where a coherence protocol (e.g.,
directory-based protocol) takes advantage of spatial and temporal locality by making
28
a copy of the block containing the data in the local cache, this protocol incurs a
round-trip access for every remote word. Each load or store access to an address cached
in a different core incurs a word-granularity round-trip message to the core allowed to
cache the address, and the retrieved data is never cached locally (the combination of
word-level access and no local caching ensures correct memory semantics).
2.3
Hardware-level Thread Migration
We now describe fine-grained, hardware-level thread migration, which we use to better
exploit data locality for our directoryless architecture. This mechanism brings the
execution to the locus of the data instead of the other way around: when a thread
needs access to an address cached on another core, the hardware efficiently migrates
the thread's execution context to the core where the data is (or is allowed to be)
cached.
If a thread is already executing at the destination core, it must be evicted and
moved to a core where it can continue running. To reduce the need for evictions and
amortize migration latency, cores duplicate the architectural context (register file, etc.)
and allow a core to multiplex execution among two (or more) concurrent threads. To
prevent deadlock, one context is marked as the native context and the other as the
guest context: a core's native context may only hold the thread that started execution
there (called the thread's native core), and evicted threads must return to their native
cores to ensure deadlock freedom [14].
Briefly, when a core C running thread T executes a memory access for address A,
it must
1. find the home core H for A (e.g., by consulting a mapping table or masking the
appropriate bits);
2. if H
=
C (a core hit),
(a) forward the request for A to the local cache hierarchy (possibly resulting in
a DRAM access);
29
3. if H
#
C (a core miss),
(a) interrupt the execution of the thread on C (as for a precise exception),
(b) unload the execution context (microarchitectural state) and convert it to a
network packet (as shown in Figure 2-1), and send it to H via the on-chip
interconnect:
i. if H is the native core for T, place it in the native context slot;
ii. otherwise:
A. if the guest slot on H contains another thread T', evict T' and
migrate it to its native core N'
B. move T into the guest slot for H;
(c) resume execution of T on H, requesting A from its cache hierarchy (and
potentially accessing backing DRAM or the next-level cache).
When an exception occurs on a remote core, the thread migrates to its native core to
handle it.
Although the migration framework requires hardware changes to the baseline
directoryless design (since the core must be designed to support efficient migration),
it migrates threads directly over the interconnect, which is much faster than other
thread migration approaches (such as OS-level migration or Thread Motion [49], which
leverage the existing cache coherence protocol to migrate threads).
2.4
Performance Overhead of Thread Migration
Since the thread context is directly sent across the network, the performance overhead
of thread migration is directly affected by the context size. The relevant architectural
state that must be migrated in a 64-bit x86 processor amounts to about 3.lKbits
(sixteen 64-bit general-purpose registers, sixteen 128-bit floating-point registers and
special purpose registers), which is what we use in this thesis. The context size will
vary depending on the architecture; in the TILEPro64 [7], for example, it amounts
30
Tile
Core
Depacketizerj
(Context Load)
No(Context
Incoming
Queue
Register File
Packetizer
Unload)
Outgoing
Queue
Interconnect Network
Figure 2-1: Hardware-level thread migration via the on-chip interconnect
to about 2.2Kbits (64 32-bit registers and a few special registers). This introduces a
serializationlatency since the full context needs to be loaded (unloaded) into (from)
the network: with 128-bit flit network and 3.1Kbits context size, this becomes
-
pkt size
Iflit sizeI
26 flits, incurring the serialization overhead of 26 cycles. With a 64-bit register file
with two read ports and two write ports, one 128-bit flit can be read/written in one
cycle and thus, we assume no additional serialization latency due to the lack of ports
from/to the thread context.
Another overhead is the pipeline insertion latency. Since a memory address is
computed at the end of the execute stage, if a thread ends up migrating to another
core and re-executes from the beginning of the pipeline, it needs to refill the pipeline.
In case of a typical five-stage pipeline core, this results in an overhead of three cycles.
To make fair performance comparisons, all these migration overheads are included
as part of execution time for architectures that use thread migrations, and their values
are specified in Table 3.1.
2.5
Hybrid Memory Access Framework
We now propose a hybrid architecture by combining the two mechanisms described:
each core-miss memory access may either perform the access via a remote access as in
Section 2.2 or migrate the current execution thread as in Section 2.3. This architecture
31
Access memory &
continue execution
Migrate
yes
Memory
access -in core C
Address
cacheable
in core C? no
Migrate
thread to
home core
Access
exc eded?
I
:
Access memory &
continue execution
I
Send remote
Remote
Migrate another
thread back to
its native core
ys
Access memory
request to
home core
4
I
Continue execution
Return data (read)
or ack (write) to
-
the requesting core C
Core originating
memory access
Network
Core where address
can be cached
Figure 2-2: Hybrid memory access framework for our directoryless architecture
is illustrated in Figure 2-2.
For each access to memory cached on a remote core, a decision algorithm determines
whether the access should migrate to the target core or execute a remote access.
Because this decision must be taken on every access, it must be implementable as
efficient hardware. In our design, an automatic predictor decides between migration
and remote access on a per-instruction granularity. It is worthwhile to mention that
we allow replication for instructions since they are read-only; threads need not perform
a remote access nor migrate to fetch instructions.
predictor in the next chapter.
32
We describe the design of this
Chapter 3
Thread Migration Prediction
3.1
Introduction
Under the remote-access-only architecture, every core-miss memory access results in
a round-trip remote request and its reply (data word for load and acknowledgement
for store).
Therefore, migrating a thread can be beneficial when several memory
accesses are made to the same core: while the first access incurs the migration costs,
the remaining accesses become local and are much faster than remote accesses. Since
thread migration costs exceed the cost required by remote-access-only designs on a
per-access basis due to a large thread context size, the goal of the thread migration
predictor is to judiciously decide whether or not a thread should migrate:
since
migration outperforms remote accesses only for multiple contiguous memory accesses
to the same location, our migration predictor focuses on detecting those.
3.2
3.2.1
Thread Migration Predictor
Per-core Thread Migration Predictor
Since the migration/remote-access decision must be made on every memory access, the
decision mechanism must be implementable as efficient hardware. To this end, we will
describe a per-core migration predictor-aPC-indexed direct-mapped data structure
33
RegFilel
PCetch
PC2
FH
Decode
ARegFile2
Execute
acheable?
(Core hit)-
No (Core miss)
Hid tr
Yes
Write
back
Memory
N
Ioa~cs
I Proceed to
3Mmoystg
iMemory stage
Remote Access
Thread Migration
Figure 3-1: Hybrid memory access architecture with a thread migration predictor on
a 5-stage pipeline core.
where each entry simply stores a PC. The predictor is based on the observation that
sequences of consecutive memory accesses to the same home core are highly correlated
with the program flow, and that these patterns are fairly consistent and repetitive
across program execution. Our baseline configuration uses 128 entries; with a 64-bit
PC, this amounts to about 1KB total per core.
The migration predictor can be consulted in parallel with the lookup of the home
core for the given address. If the home core is not the core where the thread is currently
running (a core miss), the predictor must decide between a remote access and a thread
migration: if the PC hits in the predictor, it instructs a thread to migrate; if it misses,
a remote access is performed.
Figure 3-1 shows the integration of the migration predictor in a hybrid memory
access architecture on a 5-stage pipeline core. The architectural context (RegFile2
and PC2) is duplicated to support deadlock-free thread migration (cf. Section 2.3);
the shaded module is the component of migration predictor.
In the next section, we describe how a certain instruction (or PC) can be detected
as "migratory" and thus inserted into the migration predictor.
34
3.2.2
Detecting Migratory Instructions: WHEN to migrate
At a high level, the prediction mechanism operates as follows:
1. when a program first starts execution, it runs as the baseline directoryless
architecture which only uses remote accesses;
2. as it continues execution, it monitors the home core information for each memory
access, and
3. remembers the first instruction of every multiple access sequence to the same
home core;
4. depending on the length of the sequence, the instruction address is either inserted
into the migration predictor (a migratory instruction) or is evicted from the
predictor (a remote-access instruction);
5. the next time a thread executes the instruction, it migrates to the home core if
it is a migratory instruction (a "hit" in the predictor), and performs a remote
access if it is a remote-access instruction (a "miss" in the predictor).
The detection of migratory instructions which trigger thread migrations can be
easily done by tracking how many consecutive accesses to the same remote core have
been made, and if this count exceeds a threshold, inserting the PC into the predictor
to trigger migration. If it does not exceed the threshold, the instruction is classified as
a remote-access instruction, which is the default state. Each thread tracks (1) Home,
which maintains the home location (core ID) for the current requested memory address,
(2) Depth, which indicates how many times so far a thread has contiguously accessed
the current home location (i.e., the Home field), and (3) Start PC, which tracks the
PC of the very first instruction among memory sequences that accessed the home
location that is stored in the Home field. We separately define the depth threshold 0,
which indicates the depth at which we determine the instruction as migratory.
The detection mechanism is as follows: when a thread T executes a memory
instruction for address A whose PC = P, it must
35
1. find the home core H for A (e.g., by consulting a mapping table or masking the
appropriate bits);
2. if Home = H (i.e., memory access to the same home core as that of the previous
memory access),
(a) if Depth < 0, increment Depth by one;
3. if Home 7 H (i.e., a new sequence starts with a new home core),
(a) if Depth = 0, StartPC is considered a migratory instruction and thus
inserted into the migration predictor;
(b) if Depth < 0, StartPC is considered a remote-access instruction;
(c) reset the entry (i.e., Home = H, PC = P, Depth = 1).
Memory Instruction
PC Home Core
Home
Present State
Depth Start PC
Home
Next State
Depth Start PC
Action
I, :
PC,
A
-
-
-
A
I
PC,
Reset the entry for a new sequence starting from PC,
1:
PC,
B
A
I
PC,
B
I
PC,
Reset the entry for a new sequence starting from PC 2
(evict PC, from the predictor, if exists)
1,:
PC,
C
B
I
PC,
C
I
PC,
Reset the entry for a new sequence starting from PC3
1,:
1, :
I":
17:
PC
4
C
C
I
PC,
C
2
PC,
Increment the depth by one
PC,
C
C
2
PC,
C
2
PC,
Do nothing (threshold already reached)
PC,
C
C
2
PC 3
C
2
PC3
Do nothing (threshold already reached)
PC,
A
C
2
PC3
A
I
PC,
Insert PC, into the migration predictor
(evict PC2 from the predictor, if exists)
Reset the entry for a new sequence starting from PC7
Figure 3-2: An example of how instructions (or PC's) that are followed by consecutive
accesses to the same home location (i.e., migratory instructions) are detected in the
case of the depth threshold 0 = 2.
Figure 3-2 shows an example of the detection mechanism when 0 = 2. Setting
0 = 2 means that a thread will perform remote accesses for "one-off" accesses and will
migrate for multiple accesses (> 2) to the same home core. Suppose a thread executes
'Since all instructions are initially considered as remote-accesses, setting the instruction as a
remote-access instruction will have no effect if it has not been classified as a migratory instruction.
If the instruction was migratory (i.e., its PC is in the predictor), however, it reverts back to the
remote-access mode by invalidating the corresponding entry of the migration predictor.
36
a sequence of memory instructions, I1
1I7 (non-memory instructions are ignored in
this example because they do not change the entry content nor affect the mechanism).
The PC of each instruction from 1 to 17 is PC1 , PC2 , ... PC7, respectively, and the
home core for the memory address that each instruction accesses is specified next to
each PC. When 1 is first executed, the entry {Home, Depth, Start PC} will hold
the value of {A, 1, PC 1 }. Then, when
12
is executed, since the home core of I2 (B)
is different from Home which maintains the home core of the previous instruction I1
(A), the entry is reset with the information of '2.
Since the Depth to core A has not
reached the depth threshold, PC1 is considered a remote-access instruction (default).
The same thing happens for 13, setting PC2 as a remote-access instruction. Now when
14 is executed, it accesses the same home core C and thus only the Depth field needs
to be updated (incremented by one). For 15 and I6 which keep accessing the same
home core C, we need not update the entry because the depth has already reached the
threshold 0, which we assumed to be 2. Lastly, when 17 is executed, since the Depth
to core C has reached the threshold, PC3 in the Start PC field, which represents
the first instruction (13) that accessed this home core C, is classified as a migratory
instruction and thus is added to the migration predictor. Finally, the predictor resets
the entry and starts a new memory sequence starting from PC7 for the home core A.
When an instruction (or PC) that has been added to the migration predictor is
encountered again, the thread will directly migrate instead of sending a remote request
and waiting for a reply. Suppose the example sequence I, ~ 17 we used in Figure 3-2
is repeated as a loop (i.e., I1, I2,
...
17,
1,, ... ) by a thread originating at core A.
Under a standard, remote-access-only architecture where the thread will never leave
its native core A, every loop will incur five round-trip remote accesses; among seven
instructions from 1 to I7, only two of them (I, and 17) are accessing core A which
result in core hits. Under our migration predictor with 0 = 2, on the other hand,
PC3 and PC7 will be added in the migration predictor and thus the thread will now
migrate at 13 and 17 in the steady state. As shown in Figure 3-3, every loop incurs
two migrations, turning 14,
15,
and I6 into core hits (i.e., local accesses) at core C:
overall 4 out of 7 memory accesses complete locally. The benefit of migrating a thread
37
PC3
...
A
........
.
(b) The thread migrates when it encounters 13 since it hits in the migration predictor.
(a) I2 is served via a remote-access since its
PC, PC2 , is not in the migration predictor.
/WA
B
PC
A
(d) On I7, the thread migrates back to core
A. Overall, two migrations and one remote
access are incurred for a single loop.
(c) By migrating the thread to core C,
three successive accesses to core C (14,
15 and 16) now turn into local memory
accesses.
Figure 3-3: An example of how the decision between remote access and thread
migration is made for every memory access.
becomes even more significant with a longer sequence of successive memory accesses
to the same non-native core (core C in this example).
3.2.3
Possible Thrashing in the Migration Predictor
Since we use a fixed size data structure for our migration predictor, collisions between
different migratory PCs can result in suboptimal performance. While we have chosen
a size that results in good performance, some designs may need larger (or smaller)
predictors.
Another subtlety is that mispredictions may occur if memory access
patterns for the same PC differ across two threads (one native thread and one guest
thread) running on the same core simultaneously because they share the same per-core
predictor and may override each other's decisions. Should this interference become
significant, it can be resolved by implementing two predictors instead of one per
core-one for the native context and the other for the guest context.
38
In our set of benchmarks, we rarely observed performance degradation due to
these collisions and mispredictions with a fairly small predictor (about 1KB per core)
shared by both native and guest context. This is because each worker thread executes
very similar instructions (althoiTgh on different data) and thus, the detected migratory
instructions for threads are very similar. While such application behavior may keep
the predictor simple, however, our migration predictor is not restricted to any specific
applications and can be extended if necessary as described above. It is important
to note that even if a rare misprediction occurs due to either predictor eviction or
interference between threads, the memory access will still be carried out correctly, and
the functional correctness of the program is still maintained.
3.3
Experimental Setup
We use Pin [5] and Graphite [43] to model the proposed hybrid architecture that
supports both remote-access and thread migration.
Pin enables runtime binary
instrumentation of parallel programs; Graphite implements a tile-based multicore,
memory subsystem, and network, modeling performance and ensuring functional
correctness. The default system parameters are summarized in Table 3.1.
Parameter
Settings
Cores
64 in-order, 5-stage pipeline, single issue cores, 2-way
fine-grain multithreading
32/128 KB, 2/4-way set associative, 64B block
2D Mesh, XY routing, 2 cycles per hop (+ contention),
128b flits
3.1 Kbits full execution context size, Full context
= 26 cycles
load/unload latency:
Iflit sizeI =
Pipeline insertion latency = 3 cycles
First-touch after initialization, 4 KB page size
L1/L2 cache per core
Electrical network
Migration Overhead
Data Placement
Table 3.1: System configurations used
Experiments were performed using Graphite's model of an electrical mesh network
with XY routing with 128-bit flits. Since modern NoC routers are pipelined [19], and
39
2- or even 1-cycle per hop router latencies [38] have been demonstrated, we model a
2-cycle per-hop router delay; we also account for the pipeline latencies associated with
loading/unloading packets onto the network. In addition to the fixed per-hop latency,
we model contention delays using a probabilistic model as in [37].
For data placement, we use the first-touch after initialization policy which allocates
the page to the core that first accesses it after parallel processing has started. This
allows private pages to be mapped locally to the core that uses them, and avoids all
the pages being mapped to the same core where the main data structure is initialized
before the actual parallel region starts.
3.3.1
Application Benchmarks
Our experiments use a parallel perceptron cross-validation (prcn+cv) benchmark and
a set of Splash-2 [60] benchmarks with the recommended input set for the number of
cores used2 : fft, lu-contiguous, ocean-contiguous, radi?, raytrace and water-nsq.
Chunk 1
Chunk 2
Experiment 1
Chunk 3
Chunk 4
Training data
Parallel execution
(Each thread runs a separate
Experiment 2
experiment, which sequentially
Experiment 3
Experiment 4
trains the model with (K-1) data
chunks and test with the last chunk)
Train
Total data spread across L2 cache slices
( Data chunk i is mapped to Core i )
Figure 3-4: Parallel K-fold cross-validation using perceptron
Parallel cross-validation (prcn+cv) is a popular machine learning technique for
optimizing model accuracy. In the k-fold cross-validation, as illustrated in Figure 3-4,
data samples are split into k disjoint chunks and used to run k independent leave-one2
Some were not included due to simulation issues.
Unlike other Splash-2 benchmarks, radix was originally filling an input array with random
numbers (not a primary part of radix-sort algorithm) in the parallel region; thus, we moved the
initialization part prior to spawning worker threads so that the parallel region solely performs the
actual sorting.
3
40
out experiments. Each thread runs a separate experiment, which sequentially trains the
model with k - 1 data chunks (training data) and tests with the last chunk (test data).
The results of k experiments are used either to better estimate the final prediction
accuracy of the algorithm being trained, or, when used with different parameter values,
to pick the parameter that results in the best accuracy. Since the experiments are
computationally independent, they naturally map to multiple threads. Indeed, for
sequential machine learning algorithms, such as stochastic gradient descent, this is the
only practical form of parallelization because the model used in each experiment is
necessarily sequential. The chunks are typically spread across the shared cache shards,
and each experiment repeatedly accesses a given chunk before moving on to the next
one.
Our set of Splash-2 benchmarks are slightly modified from their original versions:
while both the remote-access-only baseline and our proposed architecture do not
allow replication for any kinds of data at the hardware level, read-only data can
actually be replicated without breaking cache coherence even without directories and
a coherence protocol. We, therefore, applied source-level read-only data replication to
these benchmarks; more details on this can be found in Appendix A. Our optimizations
were limited to rearranging and replicating some data structures (i.e., only tens of lines
of code changed) and did not alter the algorithm used; automating this replication
is outside of the scope of this work. It is important to note that both the remoteaccess-only baseline and our hybrid architecture benefit almost equally from these
optimizations.
Each application was run to completion; for each simulation run, we measured the
core miss rate, the number of core-miss memory accesses divided by the total number
of memory accesses.
Since each core-miss memory access must be handled either
by remote access or by thread migration, the core miss rate can further be broken
down into remote access rate and migration rate. For the baseline remote-access-only
architecture, the core miss rate equals the remote access rate (i.e., no migrations);
for our hybrid design, the core miss rate is the sum of the remote access rate and
the migration rate. For performance, we measured the parallel completion time (the
41
longest completion time in the parallel region). Migration overheads (cf. Chapter 2.4)
for our hybrid architecture are taken into account.
3.3.2
Evaluated Systems
Since our primary focus in this chapter is to improve the capability of exploiting data
locality at remote cores by using thread migrations judiciously, we compare our hybrid
4
directoryless architecture against the remote-access-only directoryless architecture .
We refer to the directoryless, remote-access-only architecture as NoDirRA and the
hybrid architecture with our migration predictor as NoDirPred-Full.The suffix of
-Full means that the entire thread context is always migrated upon thread migrations.
3.4
3.4.1
Simulation Results
Performance
We first compare the core miss rates for a directoryless system without and with
thread migration: the results are shown in Figure 3-5. The depth threshold 9 is set
to 3 for our migration predictor, which aims to perform remote accesses for memory
sequences with one or two accesses and migrations for those with > 3 accesses to
the same core. Although we have evaluated our system with different values of 0, we
consistently use 9 = 3 here since increasing 9 only makes our hybrid design converge
to the remote-access-only design and does not provide any further insight.
While 21% of total memory accesses result in core misses for the remote-accessonly design on average, the directoryless architecture with our migration predictor
results in a core miss rate of 6.7%, a 68% improvement in data locality. Figure 3-5
also shows the fraction of core miss accesses handled by remote accesses and thread
migrations in our design. We observe that a large fraction of remote accesses are
successfully replaced with a much smaller number of migrations. For example, prcn+cv
shows the best scenario where it originally incurred a 87% remote access rate under
4
The performance comparison against a conventional directory-based scheme is provided in
Chapter 4.
42
a remote-access-only architecture, which dropped to 0.8% with a small number of
migrations. Across all benchmarks, the average migration rate is only 1% resulting in
68% fewer core misses overall.
50
87.3
" Remote Acc ess (NoDirRA)
4U
- Remote Access (NoDirPred-Full)
30
- Migration (NoDirPred-Full)
E 20
10
[
--
0
-
0.0
Figure 3-5: Core miss rate and its breakdown into remote access rate and migration
rate
This improvement of data locality relates to better performance for our directoryless
architecture with thread migration as shown in Figure 3-6. For our set of benchmarks,
our proposed system shows 25% better performance on average (geometric mean)
across all benchmarks; when excluding prcn+cv for reference, the average performance
improvement is 6%.
However, due to the relatively large thread context size, the network traffic overhead
of thread migrations can be significant.
Figure 3-7 shows on-chip network traffic,
which is measured as the number of flits sent times the number of hops traveled.
Except for prcn+cv, we observe that NoDirPred-Fullactually incurs more network
traffic than NoDirRA (even with small migration rates). Therefore, in order to make
the architecture more viable, we believe that reducing the migration costs is critical,
which is addressed in the next chapter.
43
1
(
NoDirRA
0
NoDirPred-FullI
0.6
0
L2~ 0.4
5
0.2
0
Cg+,
'X0
Figure 3-6: Parallel completion time normalized to the remote-access-only architecture
(NoDirRA)
3.5
Chapter Summary
In this chapter, we presented an on-line, PC-based thread migration predictor for
our directoryless architecture that uses thread migration and remote access. Our
results show that migrating threads for sequences of multiple accesses to the same
core can improve data locality in directoryless designs, and with our predictor, it
can result in better performance compared to the baseline design which only relies
on remote accesses. However, we observed that the high network traffic overhead of
thread migration remains since the entire thread context is always being migrated.
Therefore, we need to further reduce the migration costs, which is achieved by partial
context thread migration described in the next chapter.
44
3 NoDirRA
S3.5
3
2.5
S2
o
1.5---
-
---
U
NoDirPred-Full
--
_---
___
-
-
--
0
0.5 4?1-
Figure 3-7:
Network traffic normalized to the remote-access-only architecture
(NoDirRA)
45
46
Chapter 4
Partial Context Migration for
General Register File Architecture
4.1
Introduction
We can further reduce the cost of thread migrations by sending only a part of the
register file when a thread migrates. This is based on the observation that only some
of the registers are usually used between the time the thread migrates out of its
native core and the time it migrates back; therefore, if this subset of registers can
be accurately predicted when the thread migrates, migration costs can be cut down
significantly. Therefore, in this chapter, we present partial context migration; its goal
is to predict which registers will be read/written while the thread is away from its
native core, and to migrate only those. Implementing such partial context migration
requires the core architecture to support 1) partial loading and unloading of the thread
context, 2) the capability to predict which part of the context will be used at the
migrated core, and 3) a mechanism to handle misprediction. These are discussed in
details below.
47
Tile
Register Mask
Depacketizer
(Context Load)
Core
Program Counter
Register Mask
Packetizer
10(Context
Register File
incoming
Queue
Unload
JOutgoing
Queue
interconnect Network
Figure 4-1: Hardware-level thread migration with partial context migration support
4.2
4.2.1
Partial Context Thread Migration
Extending Migration Predictor
Figure 4-1 shows a hardware architecture to support partialcontext migration: during a
thread migration, a packetizer (or a depacketizer) decodes a register mask, a bit-vector
where each bit represents whether or not the corresponding register is to be migrated;
registers whose corresponding bits are set in the register mask will be unloaded onto
the network (or loaded from the network). With the deadlock-free thread migration
framework described in [14], even though a thread migrates away from its native core,
the native-core register file remains intact in its native context since it is not used by
any other guest threads; this allows us to carry out only the registers read "on the
trip" and bring back only the registers written while away.
We now extend our migration predictor; we observe that not only sequences of
consecutive memory accesses to the same home core but also register usage patterns
within those sequences are highly correlated with the program (instruction) flow. Our
baseline configuration uses a 128-entry predictor, each of which consists of a 64-bit
PC and a 32-bit register mask, which amounts to about 1.5KB total'. Our extended
1An N-bit mask is required for an architecture with N general register file registers where each
bit indicates whether or not the corresponding register needs to be sent in case of migrations. In this
paper, we use N = 32, which accounts for 16 64-bit registers (rdi, rsi, rbp, rsp, rbx, rdx, rcx, rax and
48
PC
57
Valid
7
Tag
Useful Register Mask
rdi rsi
0
rbp
0
---
r15 xmmO
1
0
32
SHit
=
Migrate
1
xmml5
0
Registers with 1's are sent.
(Only used when a thread
is migrating from its
native core.)
Figure 4-2: A per-core PC-based migration predictor, where each entry contains a
{PC, register mask} pair.
migration predictor is shown in Figure 4-2.
Our original predictor decides between a remote access and a thread migration
upon a core miss. With the partial context migration support, moreover, if the
thread is migrating from its native core to another core, only the registers whose
corresponding bits in the register mask are set will be transferred 2 . This register mask
field is only used when a thread leaves its native core and not when it migrates from
a non-native core to another non-native core, or when it migrates back to its native
core from outside.
In the next section, we describe how the predictor stores the used-registers information for each migratory instruction.
4.2.2
Detection of Useful Registers: WHAT to migrate
We now extend our migration predictor to support partial context migrations by
predicting which registers need to be sent for each migration and sending only those.
This requires each thread to keep track of which registers have been read/written
within a sequence of memory instructions accessing the same home core; this can
r8 to r15) and 16 128-bit XMM registers (xmmO to xmm15).
2
Special purpose registers such as rip, rflags and mxcsr are always transferred and thus are not
included in the 32-bit register mask.
49
Present State
Next State
Home Depth Start PC Used Regs
Home Depth Start PC Used Regs
Instruction
PC
1,
PC,
Home Core Regs
C
ri
-
-
I,: PC,
-
r2. r3
C
I
13
PC3
C
r2
C
I
14:
PC.4
-
r4
C
11
PC5
A
rl
C
I
PC,
C
I
PC,
r1. r2, r3
Set the used register bits for r2 and r3
C
2
PC,
r I, r2, r3
Increment the depth by one
C
2
PC,
A
1
PC5
-
C
PC,
r1
PC,
rI, r2. r3
2
PC,
r, r2,
r3
2
PC,
-
rl,
r2. r3,r4
Action
rl
Reset the entry for a new sequence
and set the used register bit for rI
r, r2, r3,r4 Set the used register mask bit for r4
r]
Insert PC, into the migration
predictor with the register mask
(rl,r2,r3and r4 are set)
Figure 4-3: An example how registers being read/written are kept track of and how
the information is inserted into the migration predictor when a specific instruction (or
PC) is detected as a migratory instruction (the depth threshold 0 = 2).
be easily implemented on top of the mechanism we described in Chapter 3.2.2. In
addition to (1) Home, (2) Depth, and (3) Start PC, each thread now also tracks (4) Used
Registers, which is a 32-bit vector where each bit indicates whether the corresponding
register has been used or not. Every instruction (both memory and non-memory)
updates this Used Registers field by setting the bit when the corresponding register is
being read or written.
It may seem that registers which are only written and not read while a thread is
away from its native core may not have to be transferred because they will be written
anyways. This is true when the ISA does not support partial registers. In our design,
however, we assume registers can be partially read or written, and thus, we treat these
registers as a necessary part of the migration context to simplify managing the case of
writing into a partial register.
When the PC is detected as a migratory instruction and thus inserted into the
migration predictor (cf. Chapter 3.2.2), the Used Registers field is inserted together
with Start PC into the Useful Register Mask in the migration predictor (see Figure 4-2).
Figure 4-3 shows an example of the detection mechanism when 0
a thread executes a sequence of instructions, I1
-
15.
=
2. Suppose
I1, I3 and 15 are memory
instructions, 12 and 14 are non-memory instructions, and r, denotes the nth register.
When I, is first executed, the entry {Home, Depth, Start PC, Used Registers} will
50
hold the value of {C, 1, PC1 , r1}. Then, when I2, a non-memory instruction using r2
and r3, is executed, the Used Registers bit-vector is updated to set the bits for r2 and
r3. When 13 is executed, it accesses the same home core C and thus the Depth field is
incremented by one; r2 is already included in the used register bit-vector, so its value
does not change. 14 simply adds r4 to the register bit-vector and lastly, when 15 is
executed, since the Depth to core C has reached the threshold, PC1 in the Start PC
field is added to the migration predictor with the register mask bits. The migration
predictor will now contain a {PC, Useful Register Mask} pair, which allows a thread
to predict the useful registers from the time the thread migrates out from its native
core until it migrates back to its native core.
4.2.3
Partial Context Migration Policy
The partial context migration policy is as follows (each case is illustrated in Figure 44): when a thread T executes a memory instruction whose PC hits in the migration
predictor and thus needs to migrate,
1. if T is migrating from its native core to a non-native core, it takes the registers
specified in the Useful Register Mask of the migration predictor (cf. Figure 4-4a);
2. if T is migrating from a non-native core to another non-native core, it takes all
the registers that T brought when T first migrated out from its native core (cf.
Figure 4-4b);
3. if T is migrating back to its native core from a non-native core, it takes only
the registers that are written while T was outside from its native core (cf.
Figure 4-4c).
4. Special purpose registers required for the thread execution (e.g., rip, rflags and
mxcsr for a 64-bit x86 architecture) are always transferred.
In order to implement these policies, a thread carries around two 32-bit masks:
V-mask and W-mask. V-mask identifies the registers that the thread may access while
51
P, rl, r4
Migrates with
r], r2, r3
Bf
~~~
V: { rl, r2, r3}
Migrates with r], r2, r3
W:. rl}
(b) Since only ri, r2 and r3 have been
brought from its native core, the V-mask
will only contain these three registers. For
a migration from core C to core D, both
non-native cores, only the registers in the
(a) Suppose a thread originated at core A
(i.e., core A is its native core). When it
migrates due to the hit in the migration
predictor, it only takes the registers specified in the useful register mask field of the
predictor.
V-mask are migrated.
Migrates with r]
Migrates with r]
V : { fl, r2, r}
1W:{rl}
PC,
add r], r3, r4
register miss (r4)
A r4
rC
V:{rl, r2,
..
r3}
W:{Irl}
(c) Suppose the register ri has been written while the thread is outside of its native
core; the W-mask contains ri. When the
thread migrates back to its native core, it
only brings this register ri in the W-mask
back.
(d) While the thread is running at a nonnative core (core D), the register miss can
happen if it wants an access to a register
that is not in the V-mask. If this happens,
the thread migrates back to its native core,
only with the written registers.
Figure 4-4: An example of a partial context thread migration.
outside of its native core (looked up in the predictor when the thread first migrated
out from its native core). W-mask keeps track of the registers that have been written
while outside the native core, and is used to implement policy (3). Since a register file
remains intact in the native context, a thread returning to its native core needs to
carry only the registers that have been modified. During migrations, these two masks
(64 bits in total) and {Home, Depth, Start PC, Used Registers} must be transferred
together with the 3.1Kbit context (cf. Chapter 2.4). With 64 cores (6 bits for the
home core ID), a maximum depth threshold of 8 (3 bits), a 64-bit Start PC and a
32-bit used register mask, a total of 169 bits have to be transferred in addition to the
context.
52
It is important to note that unlike the decision on whether to perform a remote
access or a thread migration, the useful register information in the migration predictor
is only consulted by a thread when at its native core; this is because the native context
is the only place where all the register values are maintained for the thread, and once
it leaves the native core, the thread cannot use any registers other than the ones it
initially brought from its native core (i.e., registers in V-mask).
4.2.4
Misprediction handling
This makes it possible for a thread to encounter an instruction which requires a specific
register r, which has not been brought from its native core while outside its native
core (i.e., r,
V-mask); we call this a register miss. A register miss can happen,
for example, when the program flow changes due to branches and conditional jumps
resulting in a different sequence of instructions being executed. When a register miss
occurs, the thread stops its execution (just like when a core miss occurs), and returns
to its native core (cf. Figure 4-4d).
Our migration predictor tries to minimize migrations caused by register misses;
therefore, we update the useful register mask in the migration predictor by adding
the register that caused the register miss when the thread migrates back. With this
learning mechanism, the useful register mask for a particular PC, PC1 , will eventually
converge to a superset of registers that are used after the thread migrates at PC1
until it migrates back to its native core. We show the overhead of register misses
and how much the network traffic can be reduced using partial context migrations in
Chapter 4.4.
4.3
Experimental Setup
We use Graphite [43] to model the proposed directoryless architecture that supports
both remote-access and partial context thread migration. The same system parameters
as the previous chapter are used (cf. Table 3.1).
53
4.3.1
Evaluated Systems
We compare our hybrid directoryless architecture with migration predictor (NoDirPred)
against the remote-access-only directoryless baseline (NoDirRA). To see how well
the predictor itself works, we also compare with a simple DISTANCE decision scheme
(NoDirDist) previously proposed by [41]: the intuition here is that over short distances
the round-trip remote-access overhead is low, so threads migrate only if the distance
to the home core exceeds some threshold d. We use d = 6, the average hop count
for an 8x8 mesh, and transfer the full context during migrations. We also present
the result for a directory-based cache-coherence architecture (DirCC) to provide a
sense of how directoryless designs perform compared to conventional designs. DirCC
uses the MSI protocol with distributed full-map directories on a Private-Li SharedL2 configuration. This makes for an apples-to-apples comparison between directory
schemes and directoryless designs because using the shared-L2 configuration with
the same data placement policy results in negligible differences in off-chip access
rates across all the systems we evaluate; the main performance gap stems from the
performance of on-chip cache accesses.
4.4
4.4.1
Simulation Results
Performance and Network Traffic
We compare the overall performance among DirCC, NoDirRA, NoDirDist, and
NoDirPred;the results are shown in Figure 4-5. For NoDirPred,the depth threshold
0 is set to 3. Although we have evaluated our system with different values of 0, we
consistently use 0 = 3 here since increasing 0 only makes our hybrid design converge
to the remote-access-only design and does not provide any further insight. When
compared to DirCC,NoDirRA performs worse by 59% on average, while our hybrid
architecture (NoDirPred) performs worse by 18% on average. NoDirDist performs
the worst, indicating that migration decisions must be made judiciously. Since I-cache
content is not transferred during migrations, NoDirPredshows 7% more I-cache misses
54
NoDirRA
0 NoDirPred
DirCC
" NoDirDist
U
"
4
E
p 3
2
E
0
1
0
Ne
4?
Figure 4-5: Parallel completion time normalized to DirCC
than NoDirRA on average; I-cache miss rates, however, are still very low (mostly <
0.1%) and have negligible effect on performance. We also compare on-chip network
traffic in each system, measured as the number of flits sent times the number of hops
traveled. Figure 4-6 shows that NoDirPred reduces network traffic by 24% on average
compared to DirCC,and by 55% when compared to NoDirRA; while not shown in
the figure, the network traffic for NoDirDist is prohibitive, 6x more traffic on average
compared to DirCC.
Although the average performance of NoDirPredis less than that of DirCC,it is
important to note that most of the benchmarks we are using were originally developed
with directory coherence in mind. Parallel cross-validation with the perceptron learning
algorithm (prcn+cv) is an example where directory-based coherence does not work
well; the computation requires each thread to traverse through a dataset spread across
the cores, resulting in many accesses to remote caches and high network overhead for
DirCC.As a result, NoDirPredoutperforms DirCC by 34% with 42x less traffic for
prcn+cv, demonstrating that such overhead can be eliminated by migrating threads
to the data.
To better understand the overall performance, we measured Li cache miss rates
for DirCC and NoDirPred;the results are shown in Figure 4-7. Since cache lines are
55
U
DirCC
U
NoDirRA
U
NoDirPred
6
.S~ 5
3
0
0
Figure 4-6: Network traffic normalized to DirCC
not replicated across Li caches in the directoryless design (NoDirPred),the effective
Li cache capacity increases, always resulting in lower Li miss rates than DirCC; more
important, while all Li misses under NoDirPred are forwarded to local L2 caches, a
large fraction of Li misses for DirCC result in memory requests to remote L2 caches,
a major factor in performance degradation and network traffic for directory-based
architecture.
On the other hand, directoryless designs can suffer when the core miss rate is
high, i.e., when frequently accessing data cached in remote cores; the core miss rate
of DirCC is always zero. Figure 4-8 shows that on average, 21% of total memory
accesses result in core misses for NoDirRA, which drops to only 6.6% for NoDirPred.
While not shown, this improvement is achieved with the average migration rate of
1%, indicating that the predictor works well. Raytrace and water are examples where
NoDirPred suffers in terms of both performance and network traffic due to high core
miss rates.
In order to track how traffic is reduced by partial context migration, we compare
our design with the full context migration variant, which always sends the full thread
context during migrations (NoDirPred-Full). The results are shown in Figure 4-9;
56
12
L1 miss (all local for NoDirPred)
0 L1 miss to local L2 (DirCC)
10
U
10
08
_j
4
0 L1 miss to remote L2 (DirCC)
2
_
0
U
-
.
---
-
0
-
Figure 4-7: Breakdown of Li miss rate
NoDirPred reduces out migration traffic (migrations to non-native cores) by 52%
and back migration traffic (migrations back to native cores) by 68% compared to
NoDirPred-Full.The reduction in out migration traffic is achieved by our predictor
(the useful register field) and the reduction in back migration traffic is achieved by
the W-mask, which keeps track of the written registers. While using partial context
migration occasionally induces unnecessary migrations due to register misses, we
observe almost no overhead from this because our predictor learns from each miss by
adding the missing register to the useful register mask for the appropriate PC. With
this union mechanism, however, the register mask will only grow and never shrink
back; this makes our context prediction conservative and thus, some of the registers
that are migrated may not be actually used. Across all benchmarks, around 75% of
migrated registers are actually used on average (see Figure 4-10), showing that our
predictor is reasonably efficient.
4.4.2
The Effects of Network Parameters
We further demonstrate that the relative performance and network traffic of our
hybrid architecture (NoDirPred)are maintained over different network parameters.
Figure 4-11 shows that NoDirPredoutperforms NoDirRA by 29% with 3-cycle per-hop
57
U
NoDirRA
m NoDirDist
U
NoDirPred
7 087.3
~50
40
--
- -
- -
- - - -
-
30
20
10
0
OM=n
Figure 4-8: Core miss rate for directoryless systems
latency (originally, 25% with 2-cycle per-hop latency); this is because the round-trip
nature of remote accesses suffers more from increased per-hop latency. With a 64-bit
flit network instead of 128-bit, on the other hand, the network traffic reduction rate of
NoDirPredover NoDirRA decreases from 55% to 43%; this is because a large fraction
of remote access messages (i.e., those that do not carry a data word) fit into 64 bits,
and do not need additional flits to make up for the halved bandwidth. Performance
improvements also drop slightly, but not significantly.
4.5
Chapter Summary
In this chapter, we have extended our PC-based migration predictor to support
partial context migration in order to reduce the size of the migrated context. With
significantly reduced migration costs, our evaluation results show that the migration
predictor exploits data locality to maximum advantage with minimal migration costs:
it performs better than the remote-access-only baseline by 25% on average, while
incurring less network traffic by 55% using partial context migrations.
We have further demonstrated that, for certain applications, a directoryless architecture with fine-grained partial-context thread migration can outperform or match
58
"
Reg-miss Migration
* Out Migration
U
Back Migration
E Remote Access
1.2
11
o0
0.8
0.4
- -
0.2
0
NoDirRA
NoDirPred-Full
NoDirPred
Figure 4-9: Network traffic breakdown
directory-based coherence with less on-chip traffic. While the performance of our
architecture is 18% worse than the directory-based cache-coherent architecture on
average, the network traffic is reduced by 24%; given that the architecture requires no
directories or complicated coherence protocols, we believe that our approach points
to promising avenues for simplified hardware shared memory support on many-core
CMPs.
59
MUnused
Used Registers
Registers
20
1500
0
0-
0
age
co
r
0
4
Figure 4-10: Breakdown of migrated context into used and unused registers
*DirCC
E NoDirRA
E NoDirPred
S1.8
o
E5
o
1.6
1.4
1.2
.N
0.8
M
0.6
L
z
0.4
0.2
--
0
3-cy cle hop 64-bit flit
3-cy cle hop 64-bit flit
Parallel Completion Time
Network Traffic
Figure 4-11: The effect of network latency and bandwidth on performance and network
traffic
60
Chapter 5
The EM 2 silicon implementation
5.1
Introduction
In previous chapters, we have presented a hardware mechanism for fine-grained thread
migration, and used the technique to complement remote access for a directoryless
architecture with our migration predictor; the predictor not only decides whether
to migrate a thread or perform a remote access, but also supports partial context
migration. To confirm that such an architecture is indeed realizable in actual hardware,
we implemented and fabricated a proof-of-concept chip that demonstrates the feasibility
of our approach, namely the Execution Migration Machine (EM 2 ).
The actual
implementation process also allows us to explore in detail the microarchitecture of our
proposed schemes.
This chapter discusses the design decisions and implementation details of the EM2
chip, a 110-core shared-memory processor, that supports thread migration and remote
access. The evaluation results using the RTL-level simulation of several benchmarks
are also provided.
61
Off-chip
memory
10 mm
Off-chip
memory
10 mm
Figure 5-1: Chip-level layout of the 110-core EM2 chip
5.2
5.2.1
EM' Processor
System architecture
The physical chip comprises approximately 357,000,000 transistors on a 10 mm x 10 mm
die in 45nm ASIC technology, using a 476-pin wirebond package. The EM2 chip
consists of 110 homogeneous tiles placed on a 10 x 11 grid. In lieu of a DRAM interface,
our test chip exposes the two networks that carry off-chip memory traffic via a
programmable rate-matching interface; this, in turn, connects to a maximum of 16GB
of DRAM via a controller implemented in an FPGA. The EM2 chip layout is shown
in Figure 5-1.
Tiles are connected in a 2D mesh geometry by six independent on-chip networks:
two networks carry migration/eviction traffic, another two carry remote-access requests/responses, and a further two external DRAM requests/responses; in each case,
two networks are required to ensure deadlock-free operation [14]. The six channels are
62
917 m.
..
855 urn
Six 64-bit links
Figure 5-2: EM 2 Tile Architecture
implemented as six physically separate on-chip networks, each with its own router in
every tile. Each network carries 64-bit flits using wormhole flow control and dimension order routing. The routers are ingress-buffered, and are capable of single-cycle
forwarding under congestion-free conditions, a technique feasible even in multi-GHz
designs [38].
While using a single network with six virtual channels would have utilized available
link bandwidth more efficiently and made inter-tile routing simpler, it would have
exponentially increased the crossbar size and significantly complicated the allocation
logic (the number of inputs grows proportionally to the number of virtual channels
and the number of outputs to the total bisection bandwidth between adjacent routers).
Moreover, using six identical networks allowed us to verify in isolation the operation
of a single network, and then safely replicate it six times to form the interconnect,
significantly reducing the total verification effort.
63
5.2.2
Tile architecture
Figure 5-2 shows an EM2 tile; each tile contains six Network-on-Chip (NoC) routers as
described in Chapter 5.2.1, a processor core, a migration predictor, and a single level
(L1) of instruction and data caches: an 8KB read-only instruction cache and a 32KB
data cache per tile, resulting in a total of 4.4MB on-chip cache capacity. The caches
are capable of single-cycle read hits and two-cycle write hits. The entire memory
address space of 16GB is divided into 110 non-overlapping regions as required by the
EM 2 shared memory semantics, and each tile's data cache may only cache the address
range assigned to it. In addition to serving local and remote requests for the address
range assigned to it, the data cache block also provides an interface to remote caches
via the remote-access protocol. Memory is word-addressable and there is no virtual
address translation; cache lines are 32 bytes.
The details of the EM 2 processor core architecture is described in the next section.
5.2.3
Stack-based core architecture
instruction cache
daacache
;
PC
PC
main
stack
aux
aux
stack
main
stack
stack
guest context
native context
Figure 5-3: The stack-based processor core diagram of EM 2
To simplify the implementation of partial context migration and maximally reduce
on-chip bit movement, EM2 cores implement a custom 32-bit stack-based architecture
(cf. Figure 5-3). Since the likelihood of the context being necessary increases toward
64
the top of the stack from the nature of a stack-based ISA, a migrating thread can take
along only as much of its context as is required by only migrating the top part of the
stack. Furthermore, the amount of the context to transfer can be easily controlled
with a single parameter, which is the depth of the stack to migrate (i.e., the number
of stack entries from the top of the stack).
To ensure deadlock-free thread migration in all cases, the core contains two thread
contexts, called a native context and a guest context (both contexts share the same 1$
port, which means that they do not execute concurrently). Each thread has a unique
native context where no other thread can execute; when a thread wishes to execute
in another core, it must execute in that core's guest context [14]. Functionally, the
two contexts are nearly identical; the differences consist of the data cache interface in
the native context that supports stack spills and refills (in a guest context stacks are
not backed by memory, and stack underflow/overflow causes the thread to migrate
back to its native context where the stacks can be spilled or refilled), and the thread
eviction logic and associated link to the on-chip eviction network in the guest context.
To reduce CPU area, the EM2 core contains neither a floating point unit nor an
integer divider circuit. The core is a two-stage pipeline with a top-of-stack bypass
that allows an instruction's arguments to be sourced from the previous instruction's
ALU outputs. Each context has two stacks, main and auxiliary: most instructions
take their arguments from the top entries of the main stack and leave their result
on the top of the main stack, while the auxiliary stack can only be used to copy or
move data from/to the top of the main stack; special instructions rearrange the top
four elements of the main stack. The sizes of the main stack and the auxiliary stack
are 16 and 8 entries. On stack overflow or underflow, the core automatically spills or
refills the stack from the data cache; in a sense, the main and auxiliary stacks serve as
caches for conceptually infinite stacks stored in memory.
5.2.4
Thread migration implementation
Whenever the thread migrates out of its native core, it has the option of transmitting
only the part of its thread context that it expects to use at the destination core.
65
Destination core
Source core
stack
PC
Istack
I. Context unload
PC
flits
B body
[V (B
cycles)
(I cy cle)
Body #2
....#.
II. Travel H hops
(H cycles)
Ill. Context load
(1 cycle)
HeadfHead
Migration
sta rt
I
I|
Ill
IV
++-><
--------------- ----------+---- ++-->
Head flit:
------------- >+----+
+----++
Body flit#1:
Bodyflit#2:
+--->+-------------
Migration
do"e
+K------------
--- >
2
Figure 5-4: Hardware-level thread migration via the on-chip interconnect under EM .
Only the main stack is shown for simplicity.
In each packet, the first (head) flit encodes the destination packet length as well as
the thread's ID and the program counter, as well as the number of main stack and
auxiliary stack elements in body flits that follow. The smallest useful migration packet
consists of one head flit and one body flit which contains two 32-bit stack entries.
Migrations from a guest context must transmit all of the occupied stack entries, since
guest context stacks are not backed by memory.
Figure 5-4 illustrates how the processor cores and the on-chip network efficiently
support fast instruction-granularity thread migration.
When the core fetches an
instruction that triggers a migration (for example, because of a memory access to
data cached in a remote tile), the migration destination is computed and, if there is
no network congestion, the migration packet's head flit is serialized into the on-chip
router buffers in the same clock cycle. While the head flit transits the on-chip network,
the remaining flits are serialized into the router buffer in a pipelined fashion. Once the
packet has arrived at the destination NoC router and the destination core context is
free, it is directly deserialized; the next instruction is fetched as soon as the program
counter is available and the instruction cache access proceeds in parallel with the
66
deserialization of the migrated stack entries. In our implementation, assuming a thread
migrates H hops with B body flits, the overall thread migration latency amounts to
1 + H + 1 + B cycles from the time a migrating instruction is fetched at the source
core to when the thread begins execution at the destination core. In the EM 2 chip, H
varies from 1 (nearest neighbor core) to 19 (the maximum number of hops for 10x 11
mesh), and B varies from 1 (two main stack entries and no auxiliary stack entries) to
12 (sixteen main stack entries and eight auxiliary stack entries, two entries per flit);
this results in the very low migration latency, ranging from the minimum of 4 cycles
to the maximum of 33 cycles (assuming no network congestion).1
While a native context is reserved for its native thread and therefore is always free
when this thread arrives, a guest context might be executing another thread when a
migration packet arrives. In this case, the newly arrived thread is buffered until the
currently executing thread has had a chance to complete some (configurable) number
of instructions; then, the active guest thread is evicted to make room for the newly
arrived one. During the eviction process the entire active context is serialized just as
in the case of a migration (the eviction network is used to avoid deadlock), and once
the last flit of the eviction packet has entered the network the newly arrived thread is
unloaded from the network and begins execution.
5.2.5
The instruction set
We briefly describe the instruction set architecture (ISA) of the EM 2 core below.
Stacks.
Each core context contains a main stack (16 entries) and an auxiliary stack
(8 entries), and instructions operate the top of those stacks much like RISC instructions
operate on registers. On stack overflow or underflow, the core automatically accesses
the data cache to spill or refill the core stacks. Stacks naturally and elegantly support
partial context migration, since the topmost entries which are migrated as a partial
context are exactly the ones that the next few instructions will use.
'Although it is possible to migrate with no main stack entries, this is unusual, because most
instructions require one or two words on the stack to perform computations. The minimum latency
in this case is still 4 cycles, because execution must wait for the I$ fetch to complete anyway.
67
The core implements the usual arith-
Computation and stack manipulation.
metic, logical, and comparison instructions on 32-bit integers, with the exception of
hardware divide. Those instructions consume one or two elements from the main stack
and push their results back there. Instructions in the push class place immediates on
the stack, and variants that place the thread ID, core ID, or the PC on top of the
stack help effect inter-thread synchronization.
To make stack management easier, the top four entries of the main stack can be
rearranged using a set of stack manipulation instructions. Access to deeper stack
entries can be achieved via instructions that move or copy the top of the main stack
onto the auxiliary stack and back.
Control flow and explicit migration.
Flow control is effected via the usual
conditional branches (which are relative) and unconditional jumps and calls (relative
or absolute). Threads can be manually migrated using the migrate instruction, and
efficiently spawned on remote cores via the newthread instruction.
Memory instructions.
Word-granularity loads and stores come in EM (migrating)
and RA (remote access) versions, as well as in a generic version which defers the
decision to the migration predictor. The EM and generic versions encode the stack
depths that should be migrated, which can be used instead of the predictor-determined
depths. Providing manual and automatic versions gives the user both convenience
and maximum control.
Similarly, stores come in acked as well as fire-and-forget variants. Together with
per-instruction memory fences, the ack variant provides sequential consistency while
the fire-and-forget version may be used if a higher-level protocol obviates the need for
per-word guarantees. Load-reserve and store-conditional instructions provide atomic
read-modify-write access, and come in EM and RA flavors.
68
to core
to core
scan in
No D
a
lockup
reg
-
a-D
config
reg
D a
lockup
reg
0 D
Q -
+
scan out
config
reg
clock 1
clock 2
Figure 5-5: The two-stage scan chain used to configure the EM 2 chip
5.2.6
System configuration and bootstrap
To initialize the EM2 chip to a known state during power-up, we chose to use a
scan-chain mechanism. Unlike the commonly employed bootloader strategy, in which
one of the cores is hard-coded with a location of a program that configures the rest of
the system, successful configuration via the scan-chain approach does not rely on any
cores to be operating correctly: the only points that must be verified are (a) that bits
correctly advance through the scan chain, and (b) that the contents of the scan chain
are correctly picked up by the relevant core configuration settings. In fact, other than
a small state machine to ensure that caches are invalidated at reset, the EM 2 chip
does not have any reset-specific logic that would have to be separately verified.
The main disadvantages here are (a) that the EM 2 chip is not self-initializing,
i.e., that system configuration must be managed external to the chip, and (b) that
configuration at the slow rate permitted by the scan chain will take a number of
minutes. For an academic chip destined to be used exclusively in a lab environment,
however, those disadvantages are relatively minor and worth offloading complexity
from the chip itself onto test equipment.
The scan chain itself was designed specifically to avoid hold-time violations in the
physical design phase. To this end, the chain uses two sets of registers and is driven by
two clocks: the first clock copies the current value of the scan input (i.e., the previous
link in the chain) into a "lockup" register, while the second moves the lockup register
value to a "config" register, which can be read by the core logic (see Figure 5-5). By
69
suitably interleaving the two scan clocks, we ensure that the source of any signal is the
output of a flip-flop that is not being written at the same clock edge, thus avoiding
hold-time issues. While this approach sacrificed some area (since the scan registers are
duplicated), it removed a significant source of hold-time violations during the full-chip
assembly phase of physical layout, likely saving us time and frustration.
5.2.7
Virtual memory and OS implications
Although our test chip follows the accelerator model and does not support virtual
memory and does not require a full operating system, fine-grained migration can
be equally well implemented in a full-fledged CPU architecture. Virtual addressing
at first sight potentially delays the local-vs-remote decision by one cycle (since the
physical address must be resolved via a TLB lookup), but in a distributed shared cache
architecture this lookup is already required to resolve which tile caches the data (if the
LI cache is virtually addressed, this lookup can proceed in parallel with the LI access
as usual). Program-initiated OS system calls and device access occasionally require
that the thread remain pinned to a core for some number of instructions; these can be
accomplished by migrating the thread to its native context on the relevant instruction. 2
OS-initiated tasks such as process scheduling and load rebalancing typically take place
at a granularity of many milliseconds, and can be supported by requiring each thread
to return to its native core every so often.
5.3
Migration Predictor for EM'
5.3.1
Stack-based Architecture variant
As shown in previous chapters, EM2 can improve performance and reduce on-chip traffic
by turning sequences of memory accesses to the same remote cache into migrations
followed by local cache accesses. To detect sequences suitable for migration, each EM 2
core implements a learning migration predictor-a program counter (PC)-indexed,
2
In fact, our ASIC implementation uses this approach to allow the program to access various
statistics tables.
70
Migration Predictor
Fetch PC
27
Predictor storage for lookup
5
Index VNad
Stack transfer size
Main
Aux
Tag
1
8
0
31
Fetch/Decode
stage
Specified number of stack entries
are sent. (Only used when
migrating from the native core)
Hit = Migrate
Stack transfer size
Execute stage
Execute PC for memory instruction Home core for memory address Prediction accuracy feedback
module for learning
Home core ID
# of contiguous
accesses
e/Iang xi
sMonitoring
First PC
1"-
Figure 5-6: Integration of a PC-based migration predictor into a stack-based, two-stage
pipelined core of EM 2
direct-mapped data structure shown in Figure 5-6. In addition to detecting migrationfriendly memory references and making a remote-access vs migration decision for every
non-local load and store, our predictor reduces on-chip network traffic by learning and
deciding how much of the stack should be transferred for every migrating instruction.
The predictor bases these decisions on the instruction's PC. In most programs,
sequences of consecutive memory accesses to the same home core and context usage
patterns within those sequences are highly correlated with the instructions being
executed, and those patterns are fairly consistent and repetitive across program
execution. Each predictor has 32 entries, each of which consists of a tag for the PC
and the transfer sizes for the main and auxiliary stacks.
Detecting contiguous access sequences.
While the detection mechanism is
mostly similar to the one described in Chapter 3.2.2, we describe it here as well
71
in order to provide a self-contained view of the migration predictor design in the EM2
chip. Initially, the predictor table is empty, and all instructions are predicted to be
remote-access. To detect memory access sequences suitable for migration, the predictor
tracks how many consecutive accesses to the same remote core have been made, and,
if this count exceeds a (configurable) threshold 0, inserts the PC of the instruction
at the start of the sequence into the predictor. To accomplish this, each thread
tracks (1) home, which maintains the home location (core ID) for the memory address
being requested, (2) depth, which indicates how many times thus far a thread has
contiguously accessed the recent home location (i.e., the home field), and (3) start PC,
which tracks the PC of the first instruction that accessed memory at the home core.
As shown in Figure 5-6, these data structures within the migration predictor interfaces
with the execute stage of the core.
When a thread T executes a memory instruction for address A whose PC is P, it
must
1. find the home core H for A (e.g., by masking the appropriate bits);
2. if home = H (i.e., memory access to the same home core as that of the previous
memory access),
(a) if depth < 0, increment depth by one;
(b) otherwise, if depth = 0, insert start PC into the predictor table;
3. if home = H (i.e., a new sequence starts with a new home core),
(a) if depth < 0, invalidate any existing entry for start PC in the predictor
table (thus making start PC non-migratory);
(b) reset the current sequence counter (i.e., home
depth <-
+--
H, start PC <-
P,
1).
When an instruction is first inserted into the predictor, the stack transfer sizes for the
main and auxiliary stack are set to the default values of 8 (half of the main stack)
and 0, respectively.
72
-PCI 2, 0
Native
"'uest
Migrates with 2 main stack entries
--- ---
I!
Migrates with all valid stack entries
(a) Migrating from a native core
I PC,/4
,
(b) Migrating from a guest core
1%1%
Native
# of memory accesses
migrated core < e
.at
Native
add
underflow
(c) Learning the best context size
(d) Learning from misprediction
Figure 5-7: Decision/Learning mechanism of the migration predictor
5.3.2
Partial Context Migration Policy
Migration prediction for memory accesses.
The predictor uses the instruction's
address (i.e., the PC) to look up the table of migrating sequences. When a load or
store instruction attempts to access an address that cannot be cached at the core
where the thread is currently running (a core miss) at the execute stage, the result
of the predictor lookup (at the fetch stage) is used: if the PC is in the table, the
predictor instructs the thread to migrate; otherwise, to perform a remote access.
When the predictor instructs a thread to migrate from its native core to another
core, it also provides the number of main and auxiliary stack entries that should be
migrated (cf. Figure 5-7a). Because the stacks in the guest context are not backed by
memory, however, all valid stack entries must be transferred (cf. Figure 5-7b).
Feedback and learning.
To learn how many stack entries to send when migrating
from a native context at runtime, the native context keeps track of the start PC that
caused the last migration. When the thread arrives back at its native core, it reports
the reason for its return: when the thread migrated back because of stack overflow
73
(or underflow), the stack transfer size of the corresponding start PC is decremented
(or incremented) accordingly (cf. Figure 5-7c). In this case, less (or more) of the
stack will be brought along the next time around, eventually reducing the number of
unnecessary migrations due to stack overflow and underflow.
The returning thread also reports the number of local memory instructions it
executed at the core it originally migrated to. If the thread returns without having
made 0 accesses, the corresponding start PC is removed from the predictor table
and the access sequence reverts to remote access (cf. Figure 5-7d).' This allows the
predictor to respond to runtime changes in program behavior.
5.3.3
Implementation Details
Guided by the goal of simplicity and verification efficiency, we chose to implement one
per-core migration predictor, shared between the two contexts in each core (native and
guest), rather than dual per-core predictors (one for the native context and one for the
guest), or per-thread predictors whose state is transferred as a part of the migration
context. The per-thread predictor scheme was easy to reject because it would have
significantly increased the migrated context size and therefore violated our goal of
the most efficient thread migration mechanism. The dual predictor solution, on the
other hand, could in theory improve predictions because the two threads running on a
core would not "pollute" each other's predictor tables-at the cost of additional area
and verification time. Instead, we chose to preserve simplicity and implement a single
per-core predictor shared between the native and guest contexts, sizing the predictor
tables so that our tests showed no noticeable performance degradation (32 entries).
Table 5.1 shows the interface of the migration predictor in EM 2 .
3
Returns caused by evictions from the remote core do not trigger removal, since the thread might
have completed 0 accesses had it not been evicted.
74
Port Name
Direction
Description
CLK
RSTN
IN
IN
Clock signal
Reset signal
Fetch stage
lookup-en
IN
lookup-pc[31:0]
is-em
st1_xfer-size[2:0]
IN
OUT
OUT
st2xfer-size
OUT
0/1: High when looking up the predictor to see if
it needs to migrate or not for lookup-pc.
PC at Fetch stage for the predictor lookup
0 for remote access; 1 for thread migration.
Migration context size for the primary stack; can
vary among 2, 4, 6, 8, 10, 12 and 14 entries.
Migration context size for the auxiliary stack; 0 for
none, 1 for 2 entries.
Execute stage
tracker-en
IN
tracker-reset
IN
tracker-sel
IN
exec pc[31:0]
IN
home-core[6:0]
threshold[4:0]
IN
IN
check-native-pc
IN
mig-type[5:0]
IN
run-length[4:0]
IN
High for memory instructions to keep track of run
lengths.
Clears the selected tracker (tracker-sel). Must clear
the tracker before a thread migrates out.
0/1 : selects the corresponding tracker for the
specific hardware context (native/guest).
PC at exec stage. Updates the tracker (keep track
of home core and run length).
Home core ID for exec-pc
Threshold of run length for a PC to be considered
as EM. Assumed to be a constant value throughout
the execution.
High only when a thread migrates back to its
native core and updates the predictor according to the "mig-type". Either deletes the PC
or increment/decrement the transfer context size.
"exec-pc" holds the PC that made the thread to
migrate out, which is used for this check.
Specifies the cause of migration to the native core.
There are six possible causes and each corresponds
to one bit: STI underflow, STI overflow, ST2
underflow, ST2 overflow, eviction and core Miss.
The number of memory instructions performed
while a thread was away from its native core.
Table 5.1: Interface ports of the migration predictor in EM 2
75
5.4
5.4.1
Physical Design of the EM' Processor
Overview
Our primary purpose of building the proof-of-concept EM2 chip is to demonstrate the
benefit of fine-grained partial context migration for a directoryless architecture and
how the technique scales with a large number of cores. As such, our major design
goal was to implement our proposed scheme with more than 100 cores. Having a
large number of cores is also important since it directly relates to the performance
of hardware-level thread migration. This requirement imposed tight constraints in
terms of area and power consumption for each tile, because our total die area was
10 mmx 10 mm and the power budget for the entire chip was limited to around 12W
from the number of power pins we have for the chip. Therefore, throughout the
physical design process, we focused on reducing area and power consumption, rather
than making the processor able to run at high clock speed.
We also made several decisions to simplify designs that are not at the heart of the
proposed architecture we evaluate (e.g., memory interface, routers, etc.), and along
with the verification scalability of our design (more details described in Chapter 5.6.4),
the entire EM 2 chip design and implementation took only 18 man-months.
In terms of CAD tools, we used Synopsys Design Compiler to synthesize the RTL
code, and Cadence Encounter was used for placement and routing (P&R). The signoff
timing closure was done by Synopsys Primetime static timing analysis (STA).
5.4.2
Tile-level
Figure 5-8 shows the layout of a single EM 2 tile. The tile was synthesized with the
clock period of 3ns (i.e., aiming the clock frequency of 333 MHz) and the dimensions
of the tile are 855umx917um, resulting in the area of 0.784mm2 . As shown in
Figure 5-8, SRAM blocks used for instruction and data caches take almost half of
the tile area. For the rest of the modules other than SRAMs, we allowed ungrouping
during synthesis to maximize the area efficiency; as a result, we can observe that the
76
Core
Router
SRAM
Predictor
Figure 5-8: EM2 Tile Layout
routers are placed along the border of the tile to reduce latency between neighboring
tiles. The migration predictor accounts for about 2.6% of the EM 2 tile area.
To reduce power consumption, we first decided to use the high-voltage threshold
(HVT) standard cell library instead of the regular-voltage threshold (RVT) cells; this
reduced the leakage power of the EM 2 tile by 38%, and although the HVT cells have
slower switching speed, this was not an issue since our target frequency was not high.
We also used the automatic clock gating provided by Synopsys Design Compiler, which
reduced the dynamic power of the EM 2 tile by 67%. The power reduction by each
step is shown in Table 5.2.
77
Internal Power (mW)
Switching Power (mW)
Leakage Power (mW)
Total Power (mW)
RVT
HVT
and
HVT
Clock-gating
40.8611
1.3481
32.569
74.7783
40.1873
1.3579
20.465
62.0103
13.3454
2.0070
18.793
34.1448
Table 5.2: Power estimates of the EM 2 tile (reported by Design Compiler)
5.4.3
Chip-level
As an effective evaluation of the potential of our migration architecture directed us
towards as large a core count as feasible in our 10 mmx 10 mm of silicon, our final
taped-out EM 2 chip includes 110 tiles, laid out in a 2D grid. For design simplicity and
verification efficiency, EM 2 implements a homogeneous tiled architecture; out of 110
tiles in the EM 2 ASIC, 108 are identical, while the remaining two include interfaces
to off-chip memory (cf. Figure 5-1). With this hierarchical design, our bottom-up
approach allowed us to simply replicate the layout of a single tile for the chip-level
design.
The sign-off timing closure was done with the clock frequency of 200 MHz. While
resolving setup time violations was not a big issue for EM 2 , removing hold time
violations was not trivial, which actually is more critical for a chip to function correctly
since they cannot be fixed after fabrication. While CAD tools (e.g., Encounter) solve
hold time violations commonly by inserting delay cells along the data paths, in
our design, hold time violations for the data paths between the neighboring routers
were not easily removed in this manner because the space between the two tiles
was too small to accommodate enough number of delay cells. Therefore, we instead
inserted a negative-edge flip-flop for each of these particular paths of the router, which
automatically gives an extra delay of half a clock cycle to the data path, solving hold
time violations.
The EM 2 chip die photo is shown in Figure 5-9.
78
Figure 5-9: Die photo of the 110-core EM2 chip
5.5
5.5.1
Evaluation Methods
RTL simulation
To evaluate the EM 2 implementation, we chose an idealized cache-coherent baseline
architecture with a two-level cache hierarchy (a private Li data cache and a shared L2
cache). In this scheme, the L2 is distributed evenly among the 110 tiles and the size
of each L2 slice is 512KB. An Li miss results in a cache line being fetched from the
L2 slice that corresponds to the requested address (which may be on the same tile as
the Li cache or on a different tile). While this cache fetch request must still traverse
the network to the correct L2 slice and bring the cache line back, our cache-coherent
79
baseline is idealized in the sense that rather than focusing on the details of a specific
coherence protocol implementation, it does not include a directory and never generates
any coherence traffic (such as invalidates and acknowledgements); coherence among
caches is ensured "magically" by the simulation infrastructure. While such an idealized
implementation is impossible to implement in hardware, it represents an upper bound
on the performance of any implementable directory coherence protocol, and serves as
the ultimate baseline for performance comparisons.
To obtain the on-chip traffic levels and completion times for our architecture,
we began with the post-tapeout RTL of the EM2 chip, removed such ASIC-specific
features as scan chains and modules used to collect various statistics at runtime, and
added the same shared-L2 cache hierarchy as the cache-coherent baseline. Since our
focus is on comparing on-chip performance, the working set for our benchmarks is
sized to fit in the entire shared-L2 aggregate capacity. All of the simulations used
the entire 110-core chip RTL; for each benchmark, we report the completion times as
well as the total amount of on-chip network traffic (i.e., the number of times any flit
traveled across any router crossbar).
The ideal CC simulations only run one thread in each core, and therefore only
use the native context. Although the EM2 simulations can use the storage space of
both contexts in a given core, this does not increase the parallelism available to EM 2 :
because the two contexts share the same 1$ port, only one context can be executing
an instruction at any given time.
Both simulations use the same 8 KB LI instruction cache as the EM2 chip. Unlike
the PC, instruction cache entries are not migrated as part of the thread context; while
this might at first appear to be a disadvantage when a thread first migrates to a new
core, we have observed that in practice at steady state the 1$ has usually already
been filled (either by other threads or by previous iterations that execute the same
instruction sequence), and the I$ hit rate remains high.
80
5.5.2
Area and power estimates
Area and power estimates were obtained by synthesizing RTL using Synopsys Design
Compiler (DC). For the EM 2 version, we used the post-tapeout RTL with the scanchains and statistics modules deleted; we reused the same IBM 45nm SOI process
with the ARM sc12 low-power ASIC cell library and SRAM blocks generated by IBM
Memory Compiler. Synthesis targeted a clock frequency of 800MHz, and leveraged
DC's automatic clock-gating feature.
To give an idea of how these costs compare against that of a well-understood,
realistic architecture, we also estimated the area and leakage power of an equivalent
design where the data caches are kept coherent via a directory-based MESI protocol
(CC). We chose an exact sharer representation (one bit for each of the 110 sharers) and
either the same number of entries as in the data cache (CC 100%) or half the entries
(CC 50%); in both versions the directory was 4-way set-associative. To estimate the
area and leakage power of the directory, we synthesized a 4-way version of the data
cache controller from EM 2 chip with SRAMs sized for each directory configuration,
using the same synthesis constraints (since a directory controller is somewhat more
complex than a cache controller, this approach likely results in a slight underestimate).
For area and leakage power, we report the synthesis estimates computed by DC.
While all of these quantities typically change somewhat post-layout (because of factors
like routing congestion or buffers inserted to avoid hold-time violations), we believe
that synthesis results are sufficient to make architectural comparisons.
Dynamic power dominates the power signature, but is highly dependent on the
specific benchmark, and obtaining accurate estimates for all of our benchmark is not
practical. Instead, we observe that for the purposes of comparing EM2 to the baseline
architecture, it suffices to focus on the differences, which consist of (a) the additional
core context, (b) the migration predictor, and (c) differences in cache and network
accesses. The first two are insignificant: our implementation allowed only one of the
EM 2 core contexts to be active in any given cycle, so even though the extra contexts
adds leakage, dynamic power remains constant. The migration predictor is a small
81
part of the tile and does not add much dynamic power according to our analysis. Since
we ran the same programs and pre-initialized caches, the cache accesses were the same,
meaning equal contribution to dynamic power. The only significant difference is in
the dynamic network power, which is directly proportional to the on-chip network
traffic (i.e., the number of network flits sent times the distance traveled by each flit);
we therefore report this for all benchmarks as a proxy for dynamic power.
5.6
Evaluation
Performance tradeoff factors
5.6.1
To precisely understand the conditions under which fast thread migration results in
improved performance, we created a simple parameterized benchmark that executes a
sequence of loads to memory assigned to a remote L2 slice. There are two parameters:
the run length is the number of contiguous accesses made to the given address range,
and cache misses is the number of Li misses these accesses induce (in other words,
this determines the stride of the access sequence); we also varied the on-chip distance
between the tile where the thread originates and the tile whose L2 caches the requested
addresses.
200 -
Network traffic (flitxhop)
Completion time (cycles)
100
160 -80
E RA-only
1 RA-only
2
120
a EM -12
*EM
80
2
2
40
2
60
M EM -12
-8
40-
aEM -8
-4
2EM
20
*EM -4
2
2
0
1
4
1
8
4
8
Run length
Run length
Figure 5-10: Thread migration (EM 2 ) vs Remote access (RA)
Figure 5-10 shows how a program that only makes remote cache accesses (RA-only)
compares with a program that migrates to the destination core 4 hops away, makes
the memory accesses, and returns to the core where it originated (EM 2 ), where the
82
migrated context size is 4, 8, and 12 stack entries (EM2 -4, EM2 -8, and EM 2 -12).
Since the same Li cache is always accessed-locally or remotely-both versions result
in exactly the same LI cache misses, and the only relevant parameter is the run
length. For a singleton access (run length = 1), RA is slightly faster than any of the
migration variants because the two migration packets involved are longer than the RA
request/response pair, and, for the same reason, induce much more network traffic.
For multiple accesses, however, the single migration round-trip followed by local cache
accesses performs better than the multiple remote cache access round trips, and the
advantage of the migration-based solution grows as the run length increases.
Completion time (cycles)
Network traffic (flitxhop)
250
250
200
200 -
a CC-ideal
150
NEM2-12
_8 1E
n EM2 -4
50
U CC-ideal
150
MEM2-12
00
EM -8
2
0 EMV2-4
50
6
0
0
1
2
4
8
1
2
4
8
Cache misses
Cache misses
Figure 5-11: Thread migration (EM 2 ) vs Private caching (CC)
The tradeoff against our "ideal cache coherence" private-cache baseline (CC) is
less straightforward than against RA: while CC will still make a separate request to
load every cache line, subsequent accesses to the same cache line will result in LI
cache hits and no network traffic. Figure 5-11 illustrates how the performance of CC
and EM 2 depends on how many times the same cache line is reused in 8 accesses.
When all 8 accesses are to the same cache line (cache misses = 1), CC requires one
round-trip to fetch the entire cache line, and is slightly faster than EM2 , which needs
to unload the thread context, transfer it, and load it in the destination core. Once the
number of misses grows, however, the multiple round-trips required in CC become
more costly than the context load/unload penalty of the one round-trip migration,
and EM 2 performs better. And in all cases, EM 2 can induce less on-chip network
traffic: even in the one-miss case where CC is faster, the thread context that EM 2 has
83
to migrate is often smaller than the CC request and the cache line that is fetched.
Completion time (cycles)
350
300 250
MRA-only
200 -
MCC-ideal
150 100
50
*EM
4
8
Number of hops
2
-8
12
Figure 5-12: The effect of distance on RA, CC and EM2
Finally, Figure 5-12 examines how the three schemes are affected by the on-chip
distance between the core where the thread originates and the core that caches the
requested data (with run length = 8 and cache misses = 2). RA, which requires a
round-trip access for every word, grows the fastest (i.e., eight round-trips), while CC,
which only needs a round-trip cache line fetch for every LI miss (i.e., two round-trips),
grows much more slowly. Because EM 2 only requires one round-trip for all accesses,
the distance traveled is not a significant factor in performance.
5.6.2
Benchmark performance
Figure 5-13 shows how the performance of EM2 compares to the ideal CC baseline
for several benchmarks. These include: (1) single-threaded memcpy in next-neighbor
(near) and cross-chip (far) variants, (2) parallel k-fold cross-validation (par-cv), a
machine learning technique that uses stochastic gradient learning to improve model
accuracy, (3) 2D Jacobi iteration (jacobi), a widely used algorithm to solve partial
differential equations, and (4) partial table scan (tbscan), which executes queries that
scan through a part of a globally shared data table distributed among the cache shards.
We first note some overall trends and then discuss each benchmark in detail below.
84
Completion time (normalized to CC-ideal)
3.5
3
M
U
E CC-ideal
RA-only
U
EM 2
2.5
2
1.5
1
0.5
0
memcpy-near memcpy-far
par-cv
jacobi
tbscan-16
tbscan-110
(a) Performance normalized to CC
Network traffic (normalized to CC-ideal)
3
2.5
I RA-only
N CC-ideal 0 EM 2
2
1.5
1
0.5
0
memcpy-near memcpy-far
par-cv
jacobi
tbscan-16
tbscan-110
(b) Network traffic normalized to CC
Figure 5-13: The evaluation of EM 2
Overall remarks. First, Figure 5-13a illustrates the overall performance (i.e., completion time) and on-chip network traffic of the ideal directory-based baseline (CC),
the remote-access-only variant (RA), and the EM2 architecture. Overall, EM2 always
outperforms RA, offering up to 3.9x reduction in run time, and as well or better than
CC in all cases except one. Throughout, EM 2 also offers significant reductions in
on-chip network traffic, up to 42x less traffic than CC for par-cv.
Migration rates, shown in Figure 5-14a, range from 0.2 to 20.9 migrations per
1,000 instructions depending on the benchmark. These quantities justify our focus
85
U Stack
-
25
$
20
N Eviction
N Data access
-
0 IA
=. C . 15
over/underflow
-
-
--
-
-
-
-
-
-
-
-
-
-
-
-
-
-
--
10 5
0U
5 -
memcpy-near memcpy-far
par-cv
jacobi
tbscan-16
tbscan- 110
(a) The number of migrations per thousand instructions
=Avg. migration latency (cycles) -U-Avg. migration size (% of full context)
1000
250
200
80
00
+E'
0
100
-------------
__
-------
1500 -0
40
E8
0
0
memcpy-near memcpy-far
par-cv
jacobi
tbscan-16
tbscan-110
2
(b) Thread migration performance in EM
Figure 5-14: Thread migration statistics under EM 2
on efficient thread movement: if migrations occur at the rate of nearly one in every
hundred to thousand instructions, taking 1000+ cycles to move a thread to a different
core would indeed incur a prohibitive performance impact. Most migrations are caused
by data accesses, with stack under/overflow migrations at a negligible level, and
evictions significant only in the tbscan benchmarks.
Even with many threads, effective migration latencies are low (Figure 5-14b, bars),
with the effect of distance clearly seen for the near and far variants of memcpy; the
only exception here is par-cv, in which the migration latency is a direct consequence
of delays due to inter-thread synchronization (as we explain below). At the same
time, migration sizes (Figure 5-14b, line) vary significantly, and stay well below the
60% mark (44% on average): since most of the on-chip traffic in the EM2 case is due
86
to migrations, forgoing partial-context migration support would have significantly
increased the on-chip traffic (cf. Figure 5-13b).
Memory copy.
The memcpy-near and memcpy-far benchmarks copy 32 KB (the
size of an Li data cache) from a memory address range allocated to a next-neighbor tile
(memcpy-near) or a tile at the maximum distance across the 110-core chip (mempcyfar). In both cases, EM 2 is able to repeatedly migrate to the source tile, load up a full
thread context's worth of data, and migrate back to store the data at the destination
addresses; because the maximum context size exceeds the cache line size that ideal CC
fetches, EM 2 has to make fewer trips and performs better both in terms of completion
time and network traffic. Distance is a significant factor in performance-the fewer
round-trips of EM 2 make a bigger difference when the source and destination cores
are far apart-but does not change the % improvement in network traffic, since that
is determined by the the total amount of data transferred in EM 2 and CC.
Partial table scan.
In this benchmark, random SQL-like queries are assigned to
separate threads, and the table that is searched is distributed in equal chunks among
the per-tile L2 caches. We show two variants: a light-load version where only 16
threads are active at a time (tbscan-16) and a full-load version where all of the 110
available threads execute concurrently (tbscan-110); under light load, EM 2 finishes
slightly faster than CC-ideal and significantly reduces network traffic (2.9x), while
under full load EM 2 is 1.8x slower than CC-ideal and has the same level of network
traffic.
Why such a large difference?
Under light load, EM2 takes full advantage of
data locality, which allows it to significantly reduce on-chip network traffic, but
performs only slightly better than CC-ideal because queries that access the same
data chunks compete for access to the same core and effectively serialize some of the
computation. Because the queries are random, this effect grows as the total number of
threads increases (Figure 5-15), resulting in very high thread eviction rates under full
load (Figure 5-14a); this introduces additional delays and network traffic as threads
87
ping-pong between their home core and the core that caches the data they need.
N EM 2-N100
SEM2 -N10
2
I
01.5
0
S0.5
0
Z0
1
4
8
16
32
64 110
1
4
8
16
32
64 110
Network traffic
Completion time
Figure 5-15: Performance and network traffic with different number of threads for
tbscan under EM 2
label :-LOOP
Guest
Migrates in
ld
1d
pull 2
1ld
pull 3
id
Eviction
allowed after
executing N
instructions
2
Figure 5-16: N instructions before being evicted from a guest context under EM
This ping-pong effect, and the associated on-chip traffic, can be reduced by
guaranteeing that each thread can perform N (configurable in hardware) instructions
before being evicted from a guest context, as illustrated in Figure 5-16. Figure 5-15
illustrates how tbscan performs when N = 10 and N = 100: a longer guaranteed
guest-context occupation time results in up to 2 x reductions in network traffic at the
cost of a small penalty in completion time due to the increased level of serialization.
This highlights an effective tradeoff between performance and power: with more
serialization, EM 2 can use far less dynamic power due to on-chip network traffic
88
(and because fewer cores are actively computing) if the application can tolerate lower
performance.
Parallel K-fold cross validation.
As previously described in Chapter 3.3.1, par-
allel k-fold cross-validationruns k independent leave-one-out experiments where each
experiment requires the entire data samples. Since the model used in each experiment
is necessarily sequential for sequential machine learning algorithms, each experiment
naturally map to a thread; this is the natural form of parallelization. The data samples
are split into k data chunks, which are typically spread across the shared caches; since
each experiment repeatedly accesses a given chunk before moving on to the next chunk,
it has a fairly high run length, which favors EM 2 .
With overall completion time slightly better under EM 2 than under CC-ideal and
much better than under RA-only, par-cv stands out for its 42x reduction in on-chip
network traffic vs. CC-ideal (96x vs. RA). This is because the cost of every migration
is amortized by a large amount of local cache accesses on the destination core (as
the algorithm learns from the given data chunk), while CC-ideal continuously fetches
more data to feed the computation.
Completion time for par-cv, however, is only slightly better because of the nearly
200-cycle average migration times at full 110-thread utilization (Figure 5-14b). This
is because of a serialization effect similar to that in tbscan: a thread that has finished
learning on a given chunk and migrates to proceed onto the next chunk must sometimes
wait en route while the previous thread finishes processing that chunk. Unlike tbscan,
however, where the contention results from random queries, the threads in par-cv
process the chunks in order, and avoid the penalties of eviction.
As a result, at
the same full utilization rate of 110 threads, par-cv has a better completion time
under EM 2 but tbscan performs better under CC. (At a lower utilization, the average
migration latency of par-cv falls: e.g., at 50 threads it becomes 9 cycles, making the
EM 2 version 11% faster than CC.)
89
2D Jacobi iteration. In its essence, the jacobi benchmark propagates a computation through a matrix, and so the communication it incurs is between the boundary
of the 2D matrix region stored in the current core and its immediate neighbors stored
in the adjacent cores. Since the data accesses are largely to a thread's own private
region, intercore data transfers are a negligible factor in the overall completion time,
and the runtime for all three architectures is approximately the same.
In the naive form, the local elements are computed one by one, and all of the
memory accesses to remote cores become one-off accesses; in this case, the predictor
never instructs threads to migrate and EM 2 will behave the same as the RA-only
baseline. By using loop unrolling, however, the performance of EM2 can be improved:
multiple remote loads are now performed contiguously, meaning that a thread migrates
with a few addresses for loads, and migrates back with its stack filled with multiple load
results (see Figure 5-17). In this manner, EM 2 is able to reduce the overall network
traffic because it can amortize the costs of migrating by consecutively accessing many
matrix elements in the boundary region, while CC-ideal has to access this data with
several L2 fetches. While unrolling does not change the performance under the RA
regime, it allows EM 2 to incur 31% less network traffic than RA.
LD
LD
EM2
RA
LD
LD
LD
LD
Figure 5-17: EM' allows efficient bulk loads from a remote core.
90
5.6.3
Area and power costs
Since the CC-ideal baseline we use for the performance evaluation above does not
have directories, it does not make a good baseline for area and power comparison.
Instead, we estimated the area required for MESI implementations with the directory
sized to 100% and 50% of the total Li data cache entries, and compared the area and
leakage power to that of EM 2 . The L2 cache hierarchy, which was added for more
realistic performance evaluation and not a part of the actual chip, is not included here
for both EM 2 and CC.
Table 5.3 summarizes the architectural components that differ. EM 2 requires an
extra architectural context (for the guest thread) and on-chip networks for migrations
and evictions as well as RA requests and responses. Our EM 2 implementation also
includes a learning migration predictor; while this is not strictly necessary in a purely
instruction-based migration design, it offers runtime performance advantages similar to
those of a hardware branch predictor. In comparison, a deadlock-free implementation
of MESI would replace the four migration and remote-access on-chip networks with
three (for coherence requests, replies, and invalidations), implement D$ controller
logic required to support the coherence protocol, and add the directory controller and
associated SRAM storage.
Figure 5-18 shows how the silicon area and leakage power compare. Not surprisingly,
blocks with significant SRAM storage (the instruction and data caches, as well as
the directory in the CC version) were responsible for most of the area in all variants.
Overall, the extra thread context and extra router present in EM 2 were outweighed
EM 2 CC
extra execution context in the core
migration predictor logic & storage
remote cache access support in the D$
coherence protocol logic in the D$
coherence directory logic & storage
number of independent on-chip networks
yes
yes
yes
no
no
6
no
no
no
yes
yes
5
Table 5.3: A summary of architectural costs that differ in the EM 2 and CC implementations.
91
*
routers
N D$ slice M 1$
U dir. slice
Area
400000 -
Power
Area
Power
Area
M predictor 0 core
Power
15
E
300000 -
10
200000 -
5
100000
0
0
EM2
CC100%
CC50%
Figure 5-18: Relative area and leakage power costs of EM 2 vs. estimates for exactsharer CC with the directory sized to 100% and 50% of the D$ entries (DC Ultra,
IBM 45nm SOI hvt library, 800MHz).
by the area required for the directory in both the 50% and 100% versions of MESI,
which suggests that EM 2 may be an interesting option for area-limited CMPs.
5.6.4
Verification Complexity
With evolving VLSI technology and increasing design complexity, verification costs
have become more critical than ever. Increasing core counts are only making the
problem worse because any pairwise interactions among cores result in a combinatorial
explosion of the state space as the number of cores grows. Distributed cache coherence
protocols in particular are well known to be notoriously complex and difficult to design
and verify. The response to a given request is determined by the state of all actors
in the system (for example, when one cache requests write access to a cache line,
any cache containing that line must be sent an invalidate message); moreover, the
indirections involved and the nondeterminism inherent in the relative timing of events
requires a coherence protocol implementation to introduce many transient states that
are not explicit in the higher-level protocol. This causes the number of actual states in
even relatively simple protocols (e.g., MSI, MESI) to explode combinatorially [31, and
92
I. Module
rCore
III. 4-tile system
II. Single-tile
E1M0-tisystem
a
Cache
RotrRouter
Migration
# bugs within
each module
OO
ONM
INMMMMMM
Migration
# inter-tile
bugs
# inter-module
bugs
No bugs introduced
>> by increasing the
system size
Figure 5-19: Bottom-up verification methodology of EM 2
results in complex cooperating state machines driving each cache and directory [39].
In fact, one of the main sources of bugs in such protocols is reachable transient states
that are missing from the protocol definition, and fixing them often requires non-trivial
modifications to the high-level specification. To make things worse, many transient
states make it difficult to write well-defined testbench suites: with multiple threads
running in parallel on multicores, writing high-level applications that exercise all the
reachable low-level transient states-or even enumerating those states-is not an easy
task. Indeed, descriptions of more optimized protocols can be so complex that they
take experts months to understand, and bugs can result from specification ambiguities
as well as implementation errors [35]. Significant modeling simplifications must be
made to make exploring the state space tractable [1], and even formally verifying a
given protocol on a few cores gives no confidence that it will work on 100.
While design and verification complexity is difficult to quantify and compare, both
the remote-access-only baseline and the full EM 2 system we implemented have a
significant advantage over directory cache coherence: a given memory address may
only be cached in a single place. This means that any request-remote or local-will
depend only on the validity of a given line in a single cache, and no indirections or
transient states are required. The VALID and DIRTY flags that together determine
the state of a given cache line are local to the tile and cannot be affected by state
changes in other cores. The thread migration framework does not introduce additional
93
complications, since the data cache does not care whether a local memory request
comes from a native thread or a migrated thread: the same local data cache access
interface is used. The overall correctness can therefore be cleanly separated into
(a) the remote access framework, (b) the thread migration framework, (c) the cache
that serves the memory request, and (d) the underlying on-chip interconnect, all of
which can be reasoned about separately. This modularity makes the EM2 protocols
easy to understand and reason about, and enabled us to safely implement and verify
modules in isolation and integrate them afterwards without triggering bugs at the
module or protocol levels (cf. Figure 5-19).
The homogeneous tiled architecture we chose for EM2 allowed us to significantly
reduce verification time by first integrating the individual tiles in a 4-tile system.
This resulted in far shorter simulation times than would have been possible with the
110 cores, and allowed us to run many more test programs. At the same time, the
4-tile arrangement exercised all of the inter-tile interfaces, and we found no additional
bugs when we switched to verifying the full 110-core system, as shown in Figure 5-19.
Unlike directory entries in directory-based coherence designs, EM2 cores never store
information about more than the local core, and all of the logic required for the
migration framework-the decision whether to migrate or execute a remote cache
access, the calculation of the destination core, serialization and deserialization of
network packets from/to the execution context, evicting a running thread if necessary,
etc.-is local to the tile. As a result, it was possible to exercise the entire state space
in the 4-tile system; perhaps more significantly, however, this also means that the
system could be scaled to an arbitrary number of cores without incurring an additional
verification burden.
5.7
Chapter Summary
In this chapter, we have presented the 110-core EM2 chip, a silicon implementation of
a directoryless architecture using thread migration and remote access. By employing
a stack-based architecture, EM 2 minimizes thread migration costs and elegantly
94
supports partial context migration; the taped-out chip also supports on-line learning
and prediction of when to migrate and what part of the context to send upon migration
by the migration predictor. Through RTL simulation, we demonstrate that EM 2 can
improve performance and reduce network traffic compared to the remote-access-only
design, and for some benchmarks, when compared to the cache-coherent baseline as
well.
Moreover, since EM 2 is built on top of a directoryless memory substrate, it provides
shared memory without the need of coherence protocol and directories, offsetting the
area overhead of the migration framework while reducing verification complexity at
the same time.
95
96
Chapter 6
Conclusions
6.1
Thesis contributions
For conventional manycore CMPs with private caches, the data must be brought to
the core where the thread is running whenever a thread needs data mapped on remote
shared cache slices. When a thread repeatedly accesses data at the remote caches,
however, this incurs large delays and significant network traffic. Furthermore, such
private caches need to maintain cache coherence to support shared memory, often
achieved by a complex coherence protocol and distributed directories.
In this thesis, we first proposed a directoryless architecture that uses thread
migration and remote access to access remotely mapped data. Since we do not allow
cache line replication across on-chip caches, coherence is trivially ensured without the
need for directories (and thus, we call it a directoryless architecture). At the same
time, we use our fine-grained thread migration mechanism to complement remote word
accesses in order to better exploit data locality under such an architecture. However, we
observed that high migration costs make it critical to use thread migrations judiciously.
Therefore, we have developed an on-line, PC-based migration predictor which decides
between a remote access or a thread migration at instruction granularity. Moreover,
we extend the migration predictor to support partial context thread migration by
learning and predicting the necessary thread context at runtime, which further reduces
migration costs.
97
To validate our proposed architecture, we further implemented a 110-core Execution
Migration Machine (EM 2 ) processor using a 45nm ASIC technology. This thesis
discusses the design and physical implementation details of our prototype chip which
adopts a stack-based core architecture, and also provides detailed evaluation results
using RTL-level simulation.
Our results show that the our proposed architecture with the migration predictor
can improve performance and significantly reduce network traffic compared to a remoteaccess-only architecture. We have also demonstrated that, for certain applications, our
proposed design can outperform or match directory-based coherence with less on-chip
traffic and reduced verification complexity. Given that the architecture requires no
directories or complicated coherence protocols and, unlike directory-based coherence
protocols, its verification scope does not grow with the number of cores, we believe
that our approach provides an interesting design point on the hardware coherence
spectrum for many-core CMPs.
6.2
Architectural assumptions and their implications
While the architecture proposed in this dissertation assumes in-order, single-issue
cores for the underlying hardware, modern processors often have more complex cores
for higher performance. Here, we discuss the requirements and limitations of our
proposed scheme on such complex cores, as well as the performance implications in
parallel workloads with heterogeneous threads.
Multiple outstanding memory accesses.
Under single-issue in-order cores, a
thread will not execute a memory instruction until its previous memory instruction
completes; a thread could, therefore, start migrating without extra waiting. On the
other hand, if multiple outstanding memory accesses are allowed (e.g., superscalar
out-of-order cores), a thread could have multiple outstanding remote accesses on the
fly at the time when it wishes to migrate under our directoryless architecture. In order
98
to provide functional correctness, therefore, the migration hardware needs to ensure
that all the responses are received, i.e., that there exist no outstanding remote accesses,
before a thread migration can actually happen. While this constraint is sufficient for
a correct execution, it can possibly affect the migration decision mechanism. Since
multiple remote accesses can now be interleaved hiding the latency, in order to make
the cost of thread migration worthwhile, we might need a longer run length than we did
in single-issue cores. In addition, it might also be beneficial to relax the notion of run
length to not be the number of consecutive accesses but rather that of most frequent
accesses to the same core because even the same sequence of memory instructions
can execute in different order at runtime. In terms of network traffic reduction, it is
important to note that interleaving multiple memory accesses does not help; migrating
a thread can still reduce overall on-chip traffic.
Deeply pipelined core.
While we assume a five-stage pipeline core in this thesis,
modern CPUs running at GHz frequencies often have deeper pipelines with more than
ten stages. While no architectural changes are required to specifically support deeply
pipelined cores for our design, the pipeline depth affects the cost of thread migration
since a thread needs to re-execute from the beginning of the pipeline after migrating
to another core (cf. Chapter 2.4). We have, however, observed that increasing the
pipeline depth has a negligible effect on the overall performance due to low migration
rates.
Workloads with heterogeneous threads.
A dominant class of multithreaded
programs typically runs parallel worker threads with almost identical instruction
streams (i.e., executes very similar instructions although on different data). While
our architecture is not restricted to any specific applications, such a high instruction
similarity between threads keeps the performance overhead due to extra I-cache misses
reasonably low. In addition, thread interference in the migration predictor is also
minimal for the same reason, and thus, not migrating the predictor content with a
thread and allowing a per-core predictor to be shared among threads are sufficiently
99
efficient in terms of performance.
There exists, however, another type of parallel workloads, where each thread
executes different instructions. For example, streaming applications can assign threads
to each pipeline stage to exploit pipeline parallelism; migrating a thread in such an
application would result in a higher performance overhead because most necessary
instructions are likely to be refetched to the instruction cache at the core where the
thread has migrated. Some possible future solutions to address this overhead include
taking the I-cache miss penalty into account when deciding whether to migrate or not,
and/or sending at least one or two instruction cache blocks along with the thread
context, which could possibly minimize the performance overhead (especially when
used with instruction prefetch hardware) by trading off the cost of thread migration.
6.3
Future avenues of research
Since no automatic data replication is allowed under our proposed directoryless architecture, it can limit the hardware's ability to take advantage of available parallelism,
limiting performance benefits. We believe more ways to avoid this limitation, such as
implementing thread migration on top of simplified hardware coherence or software
coherence, can be explored.
While this thesis focuses on using the migration infrastructure to accelerate remote
data access and reduce network traffic for a directoryless shared-memory architecture,
we view fine-grained partial-context thread migration as an enabling technology
suitable for many applications. Being fast and efficient, thread migrations can happen
more frequently to the level where conventional migration schemes (e.g., OS/softwarelevel migration) could not support due to their high costs. We believe investigating
the possible applications of fine-grained migration can further lead future research.
100
Bibliography
[1] Dennis Abts, Steve Scott, and David J. Lilja. So many states, so little time:
Verifying memory coherence in the cray x1. In PDP, 2003.
[2] Adapteva. Startup has big plans for tiny chip technology. In Wall Street Journal,
2011.
[3] Arvind, Nirav Dave, and Michael Katelman. Getting formal verification into
design flow. In FM2008, 2008.
[4] M. Awasthi, K. Sudan, R. Balasubramonian, and J. Carter. Dynamic hardwareassisted software-controlled page placement to manage capacity allocation and
sharing within large caches. In HPCA, 2009.
[5] Moshe (Maury) Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi
Devor, Kim Hazelwood, Aamer Jaleel, Chi-Keung Luk, Gail Lyons, Harish Patil,
and Ady Tal. Analyzing Parallel Programs with Pin. Computer, 43, 2010.
Managing wire delay in large chip[6] M. M. Beckmann and D. A. Wood.
multiprocessor caches. In MICRO, 2004.
[7] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif,
Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey, D. Wentzlaff,
W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney,
and J. Zook. TILE64 - Processor: A 64-Core SoC with Mesh Interconnect. In
Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers.
IEEE International,Feb 2008.
[8] Shekhar Borkar. Thousand core chips: A technology perspective. In Proceedings
of the 44th Annual Design Automation Conference, DAC '07, pages 746-749, New
York, NY, USA, 2007. ACM.
[9] Silas Boyd-Wickizer, Robert Morris, and M. Frans Kaashoek.
scheduling for multicore systems. In HotOS, 2009.
Reinventing
[10] Jeffery A. Brown and Dean M. Tullsen. The shared-thread multiprocessor. In
ICS, 2008.
101
[11] Koushik Chakraborty, Philip M. Wells, and Gurindar S. Sohi. Computation
spreading: employing hardware migration to specialize CMP cores on-the-fly. In
ASPLOS, 2006.
[12] Jichuan Chang and Gurindar S. Sohi. Cooperative caching for chip multiprocessors.
In ISCA, 2006.
[13] M. Chaudhuri. PageNUCA: Selected policies for page-grain locality management
in large shared chip-multiprocessor caches. In HPCA, 2009.
[14] Myong Hyon Cho, Keun Sup Shim, Mieszko Lis, Omer Khan, and Srinivas
Devadas. Deadlock-free fine-grained thread migration. In NOCS, 2011.
[15] Sangyeun Cho and Lei Jin. Managing Distributed, Shared L2 Caches through
OS-Level Page Allocation. In MICRO, 2006.
[16] Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,
Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, and Ching-Tsun Chou.
DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism. In PACT,
2011.
[17] Tilera Corporation. Tilera announces tile-gx72, the world's highest performance
and highest-efficiency manycore processor. In Tilera Press Release, Feb 2013.
[18] Blas A. Cuesta, Alberto Ros, Maria E. G6mez, Antonio Robles, and Jos6 F.
Duato. Increasing the effectiveness of directory caches by deactivating coherence
for private memory blocks. In Proceedings of the 38th Annual International
Symposium on Computer Architecture, ISCA '11, pages 93-104, New York, NY,
USA, 2011.
[19] William J. Dally and Brian Towles. Principles and practices of interconnection
networks. Morgan Kaufmann, 2003.
[20] Socrates Demetriades and Sangyeun Cho. Stash directory: A scalable directory
for many-core coherence. In High Performance Computer Architecture (HPCA),
2014 IEEE 20th InternationalSymposium on, Feb 2014.
[21] R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc.
Design of ion-implanted mosfet's with very small physical dimensions. Solid-State
Circuits, IEEE Journal of, 9(5):256-268, Oct 1974.
[22] A. DeOrio, A. Bauserman, and V. Bertacco. Post-silicon verification for cache
coherence. In ICCD, 2008.
[23] C. Fensch and M. Cintra. An OS-based alternative to full hardware coherence on
tiled CMPs. In HPCA, 2008.
102
[24] M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi. Cuckoo directory:
A scalable directory for many-core systems. In High Performance Computer
Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 169-
180, Feb 2011.
[25] International Technology Roadmap for Semiconductors. 2012 Update Overview,
2012.
[26] H. Garcia-Molina, R.J. Lipton, and J. Valdes. A Massive Memory Machine. IEEE
Trans. Comput., C-33, 1984.
[27] Anoop Gupta, Wolf dietrich Weber, and Todd Mowry. Reducing memory and
traffic requirements for scalable directory-based cache coherence schemes. In In
InternationalConference on Parallel Processing,pages 312-321, 1990.
[28] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki.
Reactive NUCA: near-optimal block placement and replication in distributed
caches. In ISCA, 2009.
[29] Rajeeb Hazra. The explosion of petascale in the race to exascale. International
Supercomputing Conference, 2012.
[30] Rajeeb Hazra. Driving industrial innovation on the path to exascale: From vision
to reality. InternationalSupercomputing Conference, 2013.
[31] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins,
H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella,
P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann,
M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van
Der Wijngaart, and T. Mattson. A 48-Core IA-32 message-passing processor with
DVFS in 45nm CMOS. In Solid-State Circuits Conference, 2010. ISSCC 2010.
Digest of Technical Papers. IEEE International,February 2010.
[32] Wilson C. Hsieh, Paul Wang, and William E. Weihl. Computation migration:
enhancing locality for distributed-memory parallel systems. In PPOPP,1993.
[33] J. Huh, C. Kim, H. Shafi, L. Zhang, D. Burger, and S. W. Keckler. A NUCA
substrate for flexible CMP cache sharing. In ICS, 2005.
[34] Jose A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. Bottleneck
identification and scheduling in multithreaded applications. In ASPLOS, 2012.
[35] Rajeev Joshi, Leslie Lamport, John Matthews, Serdar Tasiran, Mark Tuttle, and
Yuan Yu. Checking cache-coherence protocols with tla+. Formal Methods in
System Design, 22:125-131, 2003.
[36] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An Adaptive, NonUniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In ASPL OS,
2002.
103
[37] Theodoros Konstantakopulos, Jonathan Eastep, James Psota, and Anant Agarwal.
Energy scalability of on-chip interconnection networks in multicore architectures.
MIT-CSA IL- TR-2008-066, 2008.
[38] Amit Kumar, Partha Kundu, Arvind Singh, Li-Shiuan Peh, and Niraj K. Jha.
A 4.6Tbits/s 3.6GHz Single-cycle NoC Router with a Novel Switch Allocator in
65nm CMOS. In ICCD, 2008.
[39] Daniel E. Lenoski and Wolf-Dietrich Weber. Scalable Shared-memory Multiprocessing. Morgan Kaufmann, 1995.
[40] Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Christopher W. Fletcher, Michel
Kinsy, Ilia Lebedev, Omer Khan, and Srinivas Devadas. Brief announcement:
Distributed shared memory based on computation migration. In SPAA, 2011.
[41] Mieszko Lis, Keun Sup Shim, Myong Hyon Cho, Omer Khan, and Srinivas
Devadas. Directoryless shared memory coherence using execution migration. In
PDCS, 2011.
[42] P.
Michaud.
Exploiting the cache capacity of a single-chip multicore processor with execution migration. In HPCA, 2004.
[43] Jason E. Miller, Harshad Kasture, George Kurian, Charles Gruenwald, Nathan
Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. Graphite:
A distributed parallel simulator for multicores. In HPCA, 2010.
[44] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), April 1965.
[45] George Nychis, Chris Fallin, Thomas Moscibroda, and Onur Mutlu. Next generation on-chip networks: what kind of congestion control do we need? In
Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks,
page 12. ACM, 2010.
[46] J.D. Owens, W.J. Dally, R. Ho, D. N. Jayasimha, S.W. Keckler, and Li-Shiuan
Peh. Research challenges for on-chip interconnection networks. Micro, IEEE,
27(5):96-108, Sept 2007.
[47] Michael D. Powell, Arijit Biswas, Shantanu Gupta, and Shubhendu S. Mukherjee.
Architectural core salvaging in a multi-core processor for hard-error tolerance. In
ISCA, 2009.
[48] Adapteva Products. Epiphany-iv 64-core 28nm microprocessor, 2012.
[49] Krishna K. Rangan, Gu-Yeon Wei, and David Brooks. Thread motion: Finegrained power management for multi-core systems. In ISCA, 2009.
[50] Stefan Rusu, Simon Tam, Harry Muljono, Jason Stinson, David Ayers, Jonathan
Chang, Raj Varada, Matt Ratta, and Sailesh Kottapalli. A 45nm 8-core enterprise
Xeon@ processor. In ISSCC, pages 56-57. IEEE, 2009.
104
[51] D. Sanchez and C. Kozyrakis. Sed: A scalable coherence directory with flexible
sharer set encoding. In High Performance Computer Architecture (HPCA), 2012
IEEE 18th InternationalSymposium on, pages 1-12, Feb 2012.
[52] Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim,
Jaehyuk Huh, Doug Burger, Stephen W. Keckler, and Charles R. Moore. Exploiting ilp, tip, and dlp with the polymorphous trips architecture. In Proceedings of
the 30th Annual InternationalSymposium on Computer Architecture, ISCA '03,
pages 422-433, New York, NY, USA, 2003. ACM.
[53] Keun Sup Shim, Mieszko Lis, Myong Hyon Cho, Ilia Lebedev, and Srinivas
Devadas. Design Tradeoffs for Simplicity and Efficient Verification in the Execution
Migration Machine. In Proceedings of the Int'l Conference on Computer Design,
October 2013.
[54] Keun Sup Shim, Mieszko Lis, Omer Khan, and Srinivas Devadas. Thread
migration prediction for distributed shared caches. Computer Architecture Letters,
Sep 2012.
[55] Angela C. Sodan. Message-Passing and Shared-Data Programming Models-Wish
vs. Reality. In High Performance Computing Systems and Applications, 2005.
HPCS 2005. 19th InternationalSymposium on, pages 131-139, May 2005.
[56] B. Stackhouse, B. Cherkauer, M. Gowan, P. Gronowski, and C. Lyles. A 65nm 2billion-transistor quad-core itanium processor. In Solid-State Circuits Conference,
2008. ISSCC 2008. Digest of Technical Papers. IEEE International,pages 92-598,
Feb 2008.
[57] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan,
A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar,
and S. Borkar. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS.
IEEE J. Solid-State Circuits, 43:29-41, 2008.
[58] E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim,
M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring
it all to Software: Raw Machines. In IEEE Computer, pages 86-93, September
1997.
[59] David Wentzlaff, Patrick Griffin, Henry Hoffmann, Liewei Bao, Bruce Edwards,
Carl Ramey, Matthew Mattina, Chyi-Chang Miao, John F. Brown III, and Anant
Agarwal. On-Chip Interconnection Architecture of the Tile Processor. IEEE
Micro, 27:15-31, September 2007.
[60] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2
programs: characterization and methodological considerations. In ISCA, 1995.
[61] D. Yeh, Li-Shiuan Peh, S. Borkar, J. Darringer, A. Agarwal, and Wen-Mei
Hwu. Thousand-core chips [roundtable]. Design Test of Computers, IEEE,
25(3):272-278, May 2008.
105
[62] M. Zhang and K. Asanovid. Victim replication: Maximizing capacity while hiding
wire delay in tiled chip multiprocessors. In ISCA, 2005.
[63] Meng Zhang, Alvin R. Lebeck, and Daniel J. Sorin. Fractal coherence: Scalably
verifiable cache coherence. In MICRO, 2010.
106
Appendix A
Source-level Read-only Data
Replication
The directoryless design, which is used for both the remote-access-only baseline and
our proposed architecture, does not allow replication for any kinds of data at the
hardware level. Read-only data, however, can actually be replicated without breaking
cache coherence even without directories and a coherence protocol.
While we believe that the detection and replication of such data can be done
fairly straightforwardly by a compiler, automating the replication of these data is
out of the scope of this paper. Instead, we achieve this strictly at the source-level;
data can be replicated permanently for globally shared read-only data, and similarly,
read-only data in a limited scope (although globally read-write shared) can also be
easily replicated. For example, several matrix transformation algorithms contain at
their heart the pattern shown in the following pseudocode:
barrier();
for (...)
...
{
D1 = D2 + D3;
...
}
barrier 0;
where D1 "belongs" to the running thread but D2 and D3 are owned by other threads
and stored on other cores; this induces a pattern where the thread must perform
remote accesses to load D2 and D3 for every loop. Instead, during time periods when
shared data is read many times by several threads and not written, we can make
temporary local copies of the data and compute using the local copies:
107
barrier 0;
//
copy D2 and D3 to local L2, L3
for (...)
...
{
D1 = L2 + L3;
...
}
barrier();
Since these local copies are guaranteed to be only read within the barriers by the
programmer, there is no need to invalidate replicated data afterwards.
With these optimizations, we modified a set of SPLASH-2 benchmarks (FFT, LU,
OCEAN, RADIX, RAYTRACE, and WATER) in order to reduce core miss rate under the
directoryless architecture. Although we only describe our modifications for LU and
WATER here, we have applied the same techniques for the rest of the benchmarks.
LU : In the original version optimized for cache coherence (LUCONTIGUOUS),
which we used as a starting point for optimization, the matrix to be operated on is
divided into multiple blocks in such a way that all data points in a given block-which
are operated on by the same thread-are allocated contiguously. Each block is also
already page-aligned, as shown below:
Global matrix **a
Block 0
*p0
*pI
Block
*p2
1
*p3
Block 2
...
Block 3
Blocks are page-aligned
Therefore, no data restructuring is required to reduce false sharing.
During each computation phase, however, each thread repeatedly reads blocks
owned by other threads, but writes only its own thread; e.g., in the LU source code
snippet
for (k=O; k<dimk; k++)
for (j=0; j<dimj;
{
j++) {
alpha = -b[k+j*strideb];
for (i=O; i<dimi; i++)
c[i+j*stridecl += alpha*a[i+k*stridea;
}
}
108
since the other threads' blocks (a and b) are mapped to different cores than the current
thread's own block (c), nearly every access triggers a core miss.
Since blocks a and b are read-only data within this function and the contents
are not updated by other threads in the scope, we can apply the method of limited
local replication. In the modified version, a thread copies the necessary blocks-a
and b in the example above-to local variables (which are also page-aligned to avoid
false-sharing); the computation then only accesses local copies, eliminating core miss
accesses once the replication is done. We similarly replicate global read-only data such
as the number of threads, matrix size, and the number of blocks per thread.
WATER : In the original code, the main data structure (VAR) is a ID array of
molecules to be simulated, and each thread is assigned a portion of this array to work
on:
MOL
*VAR
0
MOL I
MOL 2
MOL 3e
e
Thread
Thread 0
1
The problem with this data structure is that, as all molecules are allocated contiguously,
molecules processed by different threads can share the same page and this false sharing
can induce unnecessary memory accesses to remote caches.
To address this, we modify the VAR data structure as follows:
*p0
**VAR
M2OL
0]
*p2
*pI1
M2OL
1]
*p3
M2OL 2ML3
Molecules are page-aligned
By recasting VAR as an array of pointers, we can page-align all of the molecules,
entirely eliminating false-sharing among them; this guarantees that molecules assigned
to a particular thread are mapped to the core where the thread executes.
In addition,
WATER
can also be optimized by locally replicating read-only data. For
each molecule, the thread computes some intermolecular distances to other molecules,
which requires read accesses to the molecules owned by other threads:
CSHIFT()
{
= XMA-XB[2];
XL[01 = XMA-XMB;
XL[11 = XMA-XB[O];
XL[2]
XL[31 = XA[O]-XMB;
XL[41 = XA[2]-XMB;
XL[51 = XA[O-XB[O];
109
Number of total
code lines
of
Number
changed code lines
FFT
LU
OCEAN
RADIX
RAYTRACE
WATER-NSQ
701
732
3817
662
5461
1192
21
38
30
27
46
98
Table A.1: The total number of changed code lines
XL[6] = XA[0]-XB[21; XL[7] = XA[2]-XB[0];
...
}
Here, XMB and XB are parts of molecules owned by other threads, while XMA, XA, and
XL belong to the thread that calls this function. Since all threads are synchronized
before and after this step, and the other threads' molecules are not updated, we can
safely make a read-only copy in the local memory of the caller thread. Thus, after
initially copying XMB and XB to thread-local data, the remainder of the computation
induces no further core misses.
Table A.1 shows that the total number of modified/added (code changes) lines of
code for each benchmark due to this source-level replication is small1 . These modified
benchmarks allow us to extrapolate the benefits that can be obtained by replicating
data that need no coherence on the directoryless architecture, and also to compare
the performance of the remote-access-only baseline and our hybrid scheme on top
of replication support. It is important to note that both directoryless architectures
benefit from this replication.
'Our count excludes comments and blank lines in the code. Our modifications were strictly
source-level, and did not alter the algorithm used.
110
Download