Uploaded by Zili Yang

CAMPS:Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube

advertisement
CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for
Hybrid Memory Cube
Muhammad M. Rafique
Zhichun Zhu
Department of Electrical and Computer Engineering
University of Illinois at Chicago
Chicago, Illinois, USA
mrafiq2@uic.edu
Department of Electrical and Computer Engineering
University of Illinois at Chicago
Chicago, Illinois, USA
zzhu@uic.edu
ABSTRACT
Prefetching is a well-studied technique where data is fetched from
main memory ahead of time speculatively and stored in caches or
dedicated prefetch buffer. With the introduction of Hybrid
Memory Cube (HMC), a 3-D memory module with multiple
memory layers stacked over a single logic layer using thousands
of Through Silicon Vias (TSVs), huge internal bandwidth
availability makes memory-side prefetching a more efficient
approach to improving system performance. In this paper, we
introduce a memory-side prefetching scheme for HMC based
main memory system that utilizes its logic area and exploits the
huge internal bandwidth provided by TSVs. Our scheme closely
monitors the access pattern to memory banks and make
intelligent prefetch decisions for rows with high utilization or
causing row buffer conflicts. We also introduce a prefetch buffer
management scheme that makes replacement decision within the
prefetch buffer based on both the utilization and recency of the
prefetched rows. Our simulation results indicate that our
approach improves performance by 17.9% on average, compared
to a baseline scheme that prefetches a whole row on every
memory request. Our scheme also outperforms an existing
memory-side prefetching scheme by 8.7% on average, which
dynamically adjusts the prefetch degree based on the usefulness
of prefetched data. In this sample-structured document, neither
the cross-linking of float elements and bibliography nor
metadata/copyright information is available. The sample
document is provided in “Draft” mode and to view it in the final
layout format, applying the required template is essential with
some standard steps.
CCS CONCEPTS
Computer systems organization → Processors and memory
architectures; ● Hardware → Semiconductor memory
●
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from Permissions@acm.org.
ICPP 2018, August 13–16, 2018, Eugene, OR, USA
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-6510-9/18/08…$15.00
https://doi.org/10.1145/3225058.3225112
KEYWORDS
Hybrid Memory Cube (HMC), Prefetch Buffer, Memory-Side
Prefetching, Memory Bandwidth.
ACM Reference format:
Muhammad Rafique, Zhichun Zhu. 2018. CAMPS: Conflict-Aware
Memory-Side Prefetching Scheme for Hybrid Memory Cube. In ICPP 2018:
47th International Conference on Parallel Processing, August 13–16, 2018,
Eugene, OR, USA. ACM, New York, NY, USA, 9 pages.
https://doi.org/10.1145/3225058.3225112
1. INTRODUCTION
The increasing speed and bandwidth gap between processor and
memory, commonly referred to as “memory wall” problem [1], is
a major performance bottleneck in today’s computer systems.
This problem gets even worse as we enter the era of many-core
systems and big-data applications. The traditional DDRx-based
main memory system fails to cope with the increasing bandwidth
and speed requirement from the processor. The off-chip data
access latency, limited bandwidth, and random request in multicore environment are major driving factors for the researchers to
find alternative approach for main memory organization. One of
the ideas is to reduce the off-chip data access latency by bringing
processing closer to data, known as Processing-In-Memory (PIM)
[7], [14], [18]. Although this approach is promising to hide data
access latency, challenge on integrating logic and memory has
been a major obstacle in turning this idea into successful
commercial product. Recently, this idea revives again when
Micron® introduced Hybrid Memory Cube (HMC) [2], a 3-D
structure where multiple memory layers are stacked over a single
logic layer and connected by Through Silicon Vias (TSVs). This
3-D stacking of memory over logic area mitigates the problem of
bandwidth gap between processor and main memory to much
extent and reduces the total energy consumption per useful unit
of work done [2].
The 3-D stacking of DRAM on top of logic layer not only
opens whole new possibilities on the way we manage main
memory requests, but also the room for improvement of existing
techniques used to tackle the memory wall problem. One of the
techniques used to hide data access latency is “Prefetching”,
where data is brought in speculatively ahead of actual need and
is stored near the processing element with minimum access
latency. While prefetching tends to improve system performance,
it comes with overheads of increased memory bandwidth
requirement and energy consumption. Prefetch transactions also
compete with demand accesses to the main memory for precious
resources like memory bandwidth and cache space. Both core-
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
side prefetching [18, 23, 26] and memory-side prefetching [8, 19, 25]
have been studied extensively.
There are several design considerations to perform
prefetching such as when and which data should be prefetched
and where the prefetched data should be stored. With the 3-D
stacking of DRAM over a logic layer, the optimal design choice of
memory-side prefetching may become different from those
existing schemes targeting conventional DDRx-based main
memory system as it provides substantial improvement in
bandwidth and the availability of logic area near memory.
In this paper, we propose CAMPS, Conflict Aware Memoryside Prefetching Scheme which stores prefetched data in a
dedicated buffer located in the logic base of HMC. The data is
moved between memory bank and prefetch buffer at the
granularity of row buffer size, taking full advantage of thousands
of TSVs, which are not available in DDRx-based main memory
systems. The main idea is to closely monitor the access patterns
to different rows within a bank and prefetch a whole row to the
prefetch buffer based on two criteria. First, it prefetches those
rows that tend to cause more row buffer conflicts. Second, it
prefetches those rows that have higher utilization i.e. with
multiple blocks within the row being accessed. We also propose
a replacement policy designed specifically for the prefetch buffer.
When a replacement is needed, the decision is made dynamically
based on both the utilization and recency of the prefetched rows.
Our scheme implements prefetching on memory-side and
prefetched data is not aggressively pushed to caches. This avoids
cache pollution and saves energy that otherwise would be
consumed by moving unutilized data between caches and main
memory.
The following are the major contributions of this work:
•
We explore the memory-side prefetching in Hybrid Memory
Cube by extensively utilizing its huge internal bandwidth
and its logic base.
•
We introduce a prefetch decision mechanism that prefetches
rows based on utilization and possibility of causing row
buffer conflicts.
•
We propose a prefetch buffer management policy that makes
replacement decisions based on the utilization of prefetch
buffer data along with the recency of rows in the prefetch
buffer.
Simulation results indicate that our scheme can effectively
improve the system performance. Compared to a baseline scheme
that prefetches a whole row on every memory request, our
approach improves performance by 17.9% on average for eight
core multiprogramming SPEC CPU2006 workloads. When
compared to an existing memory-side prefetching scheme that
dynamically adjusts the prefetch degree based on the usefulness
of prefetched data and uses traditional LRU replacement policy
for prefetch buffer, our scheme can improve performance by 8.7%
on average.
The rest of the paper is organized as follows: Section II briefly
describes the HMC structure and basics of prefetching mechanism
and motivation for our work. Section III describes our proposed
prefetching scheme in detail. Section IV describes our experimental
methodology. Analyses of the effectiveness of our approach are
presented in section V. Section VI shows other related work and
Section VII concludes our paper.
Muhammad M. Rafique and Zhichun Zhu
2. BACKGROUND AND MOTIVATION
In this section, we briefly review HMC architecture, prefetching
and design considerations for effective prefetching mechanism. We
also analyze how we can transform memory-side prefetching to
new memory technologies and show why HMC architecture can be
best utilized for memory-side prefetching.
2.1 HMC Architecture
A hybrid memory cube (HMC) consists of multiple memory layers
stacked over a single logic layer as shown in Figure 1.
Figure 1: HMC Structure
Each memory layer along with the logic layer is further
partitioned into slices. A single slice of logic layer and multiple
layers of memory partitions are connected with multiple Through
Silicon Vias (TSVs), forming a column-like structure called vault.
Each vault is functionally independent and has its own vault
controller residing in its logic base. The vault controller manages
the lower level DRAM commands like address mapping,
refreshing and memory access scheduling, taking this burden off
the main memory controller residing on processor chip. A single
memory layer partition is further divided into multiple banks that
can be accessed concurrently. A single vault can have as many as
16 banks (two banks per vault-layer with 8 DRAM layers), which
makes up to 512 banks available in a single HMC package with
32-vault configuration [2].
This organization of banks has a huge potential of exploiting
bank-level parallelism, as this feature is critical for random
requests coming from today’s multi-core and many-core
processors. The processor is connected to HMC through very
high-speed serial I/O links operating at as much as 30 Gbp/s. All
the serial links are connected to the vault controllers through a
crossbar switch that routes the request packet coming from the
processor to a particular vault controller. These serial links
operate at very high frequency and along with thousands of TSVs
connecting multiple memory layers to vault controller,
collectively provide enormous amount of bandwidth that reaches
up to 320Gbp/s. Figure 2 shows architecture of HMC-based main
memory system equipped with memory-side prefetcher.
CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube
2.2 Prefetching
Prefetching is a well-known technique used to speed up program
execution by fetching data speculatively ahead of its actual need.
The idea is to bring additional cache blocks from main memory
that might be useful in future along with the requested blocks.
Although, prefetching tends to improve performance, it may also
degrade performance if not done carefully. There are some design
considerations for prefetching which we will discuss shortly.
Prefetching can be implemented in either software or
hardware. In software-based prefetching, the prefetch
instructions can be inserted by the compiler [22] or by the
programmer. This type of prefetching gives programmer control
over program execution but also increases the code footprint. In
hardware-based prefetching, memory access patterns are
monitored by specialized hardware and data is prefetched to
either caches or dedicated prefetch buffer, based on certain
prediction mechanism that takes advantage of previous access
patterns, bandwidth availability. Our scheme deploys an effective
hardware-based memory-side prefetching, utilizing the huge
internal bandwidth and logic area available near data storage.
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
prefetch is another factor to be taken into consideration.
Prefetching should be done just in time to gain maximum
benefits. Too early prefetching may cause prefetched data
eviction even before potential use. Too late prefetching
essentially kills the purpose of prefetching as it would not hide
data access latency. Finally, question regarding where to store
prefetched data should be answered. Prefetched data can be
brought to caches directly along with demanded data or it can
also be placed in a separate prefetch buffer. Storing in caches
would reduce the effective cache capacity for storing actual
requested data while separate prefetch buffer would require
additional space.
In summary, an efficient prefetching mechanism correctly
predicts and fetches the data in timely fashion and stores the data
where it can be accessed with minimum overhead.
2.4 Motivation
Keeping key design considerations in mind and analyzing the
advantages of HMC architecture explained earlier, we now
explain how it can be best exploited to design an efficient
memory-side prefetching mechanism. As mentioned earlier, the
placement of prefetched data is key design choice while
implementing efficient hardware prefetching. Several previous
proposals suggest that the prefetched data should be brought into
caches [15, 24], so that it can be readily available to the processor
core with minimum latency. In such case, the effective capacity
of cache for demand requests would be reduced and prefetched
data might get evicted without use.
The alternate approach to this problem is to prefetch data
into a dedicated prefetch buffer residing in the memory controller
and push it to the caches when requested [8, 10, 19, 26]. The idea
of memory-side prefetching gained attention because it has more
information on actual situation of memory related parameters
like channel availability and state of row buffer. Prefetch can be
initiated only when channel is available to carry out transaction
for prefetching. Core-side prefetching on the other hand, relies
on prediction algorithms and has little knowledge of memory
bank state. Aggressive prefetching could be a waste of resources
such as memory bandwidth and energy, and it might delay the
demand requests due to additional network traffic congestion
[24].
Figure 2. HMC Architecture equipped with Memory-Side
Prefetcher
2.3 Prefetching Design Considerations
An effective prefetching scheme improves system performance
by reducing average memory access time (AMAT) as prefetched
data is available in a storage space that can be accessed with
shorter latency. Prefetching comes with its own overheads. There
are certain issues that need to be addressed to make sure the cost
of implementing prefetching doesn’t exceed the benefits gained
from it. For example, what to prefetch is the most important
design consideration. Special attention should be paid in selecting
candidate data for prefetching as unnecessary prefetched data
cause energy overhead and additional delays for data that is
actually requested by the processor from main memory. When to
With the introduction of HMC and availability of more logic
real estate to control memory operations, memory-side
prefetching seems to be a more pragmatic approach for
implementing memory-side prefetching due to the following
reasons:
•
•
•
Prefetch buffer can easily be placed within vault controller.
As vaults are functionally independent to each other, a vault
controller can accommodate one buffer that would cater to
requests destined for it.
Due to availability of huge internal bandwidth made possible
by thousands of TSVs [32], a whole row of data can be
aggressively prefetched to the buffer without utilizing
external links connected to processor, thus reducing
congestion.
As prefetching implemented within vaults doesn’t utilize
external resources like I/O links, under-utilization of prefet-
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
•
•
-ched data wouldn’t hurt as much as it would in case of coreside prefetching or memory-side prefetching in traditional
DRAM based systems where prefetch buffer resides in the
memory controller on processor chip.
The prefetched data is not proactively pushed towards upper
level caches, thus avoiding the cache pollution. It can be
pushed only if requested, thereby turning “pre” fetching into
“demand” fetching.
All I/O serial links operates at much higher frequency as
compared to regular memory buses [2], accessing off-chip
prefetched data would incur shorter latency.
3. PROPOSED PREFETCHING SCHEME
Considering the potential advantages of implementing memoryside prefetching on HMC platform described earlier, we propose
a novel memory-side prefetching scheme that prefetches data by
extensively utilizing huge internal bandwidth available in HMC
vaults while not aggressively pushing it to the caches, except
when requested. The memory-side prefetching scheme is
implemented entirely in the vault controller residing in the logic
base of HMC. Our scheme closely monitors the state of memory
banks and prefetches data cautiously based on utilization of rows
currently opened in the row buffer. It also profiles recently
opened rows and prefetches the rows that tend to cause row
buffer conflicts. Our scheme further considers the utilization and
recency of each row residing in the prefetch buffer and makes
intelligent replacement decisions.
Muhammad M. Rafique and Zhichun Zhu
The row utilization information kept in CT is used later to
determine whether a row causes row buffer conflicts. When there
is a row buffer miss, our scheme precharges the bank, activates
the required row and serves the request as normal. In addition, if
the newly opened row already has an entry in CT, this means the
row has been accessed recently, which indicates that this row
caused row-buffer conflict and is a good candidate for
prefetching. After fetching this row to the prefetch buffer, its
entry will be removed from the CT and the bank is precharged. If
CT is full, our scheme uses LRU policy to evict information for
the least recently accessed row. If the newly opened row has no
entry in CT, then our scheme keeps that row open and inserts its
information to RUT as explained above. Figure 3 summarizes our
scheme.
As we will show later in the results section, this approach
considerably reduces the row buffer conflicts because both the
frequently accessed rows with high utilization and the rows
causing more row-buffer conflicts are prefetched to the prefetch
buffer. As our scheme is based on aggressive prefetching [15], we
will introduce another optimization scheme next to manage the
prefetch buffer effectively and minimize the overheads of
aggressive prefetching.
3.1 Conflict-Aware Prefetching
The row buffer conflicts within memory banks are major cause of
delays in accessing main memory. In conflict-aware prefetching,
our scheme takes this factor into account by profiling and
prefetching rows that tend to cause most row buffer conflicts. In
this way, requests to these rows will be served from prefetch
buffer, thus reducing row buffer conflicts. To implement this
scheme, we use two different tables for profiling the behavior of
rows. The first one is the Row Utilization Table (RUT) with one
entry for each of the 16 banks in a vault, which keeps tracks of
how many distinct cache lines have been accessed from a row
that is in the row buffer. The second is the Conflict Table (CT)
with 32 entries per vault to keep track of rows that have been
replaced from row buffers recently. The conflict table has a fully
associative structure, where its entries are shared by all the banks
in a vault.
Once the request from processor is sent to the memory, the
vault controller will first check the prefetch buffer to see whether
it has the requested data. If there is a miss in prefetch buffer, the
controller will check whether this is a row-buffer hit. Upon the
row-buffer hit, it serves the request from row buffer and stores
the utilization of the row in the RUT if the information is not
there yet. Each entry in RUT is a utilization counter, which is
incremented every time that a request is served from the
corresponding row. If the number of accesses to a row exceeds a
threshold value (four in our experiment), our scheme fetches the
whole row to the prefetch buffer and precharges bank to make it
ready for next request. A different row access to the same bank
replaces the previous row utilization entry from the bank in RUT
and the replaced entry is moved to CT.
Figure 3. CAMP Scheme
3.2 Utilization and Recency Based Prefetch
Buffer Management
Existing memory-side prefetching schemes choose least recently
accessed entry in the prefetch buffer as a replacement candidate
[8, 13]. To better utilize the limited number of entries in the
prefetch buffer, our scheme considers the utilization and recency
of each row that has been brought to prefetch buffer. Utilization
of row is defined as the number of distinct cache lines referenced
within that row. Recency of row is defined as when is the last time
that this row has been referenced within prefetch buffer.
Our scheme uses a utilization counter for each prefetched
row, which is incremented every time a distinct cache line is
accessed within the row. To keep track of recency, we use a
separate recency counter for each prefetched row. When a row is
accessed, the counter sets the value of 15 (the number of entries
in prefetch buffer) to the most recently accessed row, while other
CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube
rows whose recency values were larger than the MRU row
decrements. The least recently accessed row has recency counter
value of 0. To make a replacement decision, our scheme will first
check whether there is any row with all distinct cache lines
accessed. If so, that row does not need to stay in the prefetch
buffer since all its data have been transferred to the processor. If
no such row exists, our scheme adds up the values of utilization
counter and recency counter and replaces the row with the
minimum sum. If two or more rows have the same minimum sum,
the row with the lowest utilization count value will be evicted.
Figure 4. summarizes this scheme. This approach gives highly
utilized and frequently accessed rows longer life in prefetch
buffer. As the results will show, it is a more efficient replacement
scheme as compared to LRU.
4.1 Workloads
We use SPEC CPU2006 [6, 20] benchmarks to form eight-core
multiprogramming workloads. A total of 12 different sets of
benchmarks have been chosen and categorized based on misses
per kilo instructions (MPKI), [13]. The applications with MPKI
value ≥ 20 are categorized as high memory intensive (HM) ones.
Similarly, applications with MPKI values 20>MPKI≥1 are
categorized as low memory intensive (LM) ones. Then we choose
four applications from HM set and four from LM set to make a
third working set of mixed (MX) benchmarks. Table II gives
details of these different eight-core multiprogramming
workloads.
Parameter
8 cores @ 3GHz, issue width = 4, x86 ISA, out-oforder.
Caches
L1(I/D): 32KB pvt., 2-way, hit lat. = 2 cycles
L2: 256 KB pvt., 4-way, hit lat. = 6 cycles
L3: 16MB shrd.,16-way, hit lat. = 20 cycles, 64B
line
HMC
8 DRAM layers, 32 vaults, 2 banks per vault layer,
1kB row-buffer size.
DDR3-1600, Queue Size (R/W) = 32
trcd= 11 cycles, trp=11 cycles, tcl= 11cycles
Serial Links
4 serial links with 16 I/P and 16 O/P lanes
for full duplex configuration, 12.5Gbps link.
PF Buffer
16KB per vault, fully associative, 1KB line, hit
latency = 22 cycles
Address Mapping: RoRaBaVaCo
(row-rank-bank-vault-column)
Memory Scheduling: FR-FCFS [31]
(first-ready, first-come-first-serve)
Page Policy: Open Page
Figure 4. Replacement policy for prefetch buffer
Our proposed prefetching scheme introduces some area
overheads. The row utilization table (RUT) consists of 16 entries
per vault controller, with each entry being 20 bits, making the
total hardware overhead of RUT to be 40 bytes. As we have 32
vault controllers, it gives a total overhead of 1.25KB. Similarly, we
also have one conflict table (CT) per vault controller with 32
entries and each entry also being 20 bits. Hardware overhead of
CT also turned out to be 2.5KB in total for 32 vault controllers.
Thus, our scheme introduces a total overhead of 3.75KB over the
baseline prefetcher architecture, which is trivial compared with
the prefetch buffer itself. Note that our scheme has the same
prefetch buffer size as other memory-side prefetching schemes
used in the experiments.
4. EXPERIMENTAL METHODLOGY
We evaluate our enhanced memory-side prefetching scheme
using cycle-accurate x86 full system simulator gem5 [4]. Hybrid
Memory Cube (HMC) is modeled based on [2, 11, 13, 21]. This
simplified yet accurate HMC model is composed of vault
controllers, serial links, main internal crossbar and external HMC
controller. The detailed DRAM operations are based on model
presented in [3] and [16]. The details of simulated system are
shown in Table I.
Configuration
Processor
DRAM
Parameters
3.3 Overhead
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
Vault
Controller
Table I: Experimental Configuration
ID
Benchmarks
HM1
HM2
HM3
HM4
LM1
LM2
LM3
LM4
MX1
MX2
MX3
MX4
bwaves, gems, gcc, lbm, bwaves, gcc, lbm, gems
milc, gems, sphinx, omnetpp, sphinx, milc, omnetpp, gems
gcc, mcf, lbm, milc, mcf, gcc, milc, lbm
sphinx, gcc, lbm, bwaves, sphinx, bwaves, lbm, gcc
cactus, bzip2, astar, wrf, wrf, bzip2, cactus, astar
tonto, zeusmp, h264ref, astar, zeusmp, h264ref, astar, tonto
bzip2, zeusmp, cactus, tonto, cactus, zeusmp, bzip2, tonto
astar, tonto, bzip2, h264ref, tonto, astar, bzip2, h264ref
bwaves, gcc, cactus, wrf, cactus, gcc, wrf, bwaves
gems, sphinx, tonto, h264ref, sphinx, gems, h264ref, tonto
milc, lbm, wrf, bzip2, lbm, bzip2, milc, wrf
gcc, bwaves, bzip2, astar, bwaves, gcc, bzip2, astar
Table II: SPEC CPU2006 Benchmark Sets
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
Muhammad M. Rafique and Zh ichun Zhu
Performance Improvement Over Different Prefetching Schemes
(higher is better)
1.5
BASE
BASE-HIT
MMD
CAMPS
CAMPS-MOD
Normalized Speedup
1.4
1.3
1.2
1.1
1
0.9
0.8
Multiprogrammed Workloads
Figure 5. Normalized performance gains of CAMPS with different schemes
The experiments first fast-forward the simulations for two
billion instructions from the beginning and warm up the caches
for 100 million instructions. Then detailed simulations are run for
800 million instructions.
5. RESULTS AND ANALYSIS
In this section, we present the results of our proposed scheme
and compare it to multiple memory-side prefetching schemes.
The first one is the baseline scheme, which prefetches a whole
row at the first access to the row (called, BASE afterward). The
second scheme prefetches a whole row if the row has two or more
hits based on the requests in the read queue. This scheme is
referred to as BASE-HIT. We also compare our scheme with an
existing memory-side prefetching scheme [8], which dynamically
adjusts the prefetch degree based on the usefulness of prefetched
data and uses traditional LRU policy for prefetch buffer
management (called MMD afterward). Our conflict aware
prefetching scheme will be referred to as CAMPS and our
utilization and recency based prefetch buffer management
scheme, applied on the top of CAMPS is referenced as CAMPSMOD.
5.1 Performance Improvement
Figure 5. shows the overall performance gains obtained by our
schemes compared with different memory-side prefetching
schemes mentioned above. The performance gains are calculated
as the geometric mean of the IPC values of different workloads
running on the eight processor cores. The performance of
baseline is normalized to 1. The average gain for the high memory
intensive workloads (HM) is 24.9% over the BASE, 16.3% over
BASE-HIT, and 21.8% over the MMD, respectively. The
performance gain for the low memory-intensive benchmarks (LM)
is 9.4% over BASE, 5.1% over BASE-HIT, and 4.9% over the MMD,
respectively. Finally, for mixed benchmarks (MX), our scheme
improves performance by 19.6%, 16.1% and 8.7% over BASE,
BASE-HIT, and MMD, respectively. On average, our scheme
CAMPS-MOD out-performs BASE by 17.9%, BASE-HIT by 16.8%,
and MMD by 8.7%, respectively, across a wide range of
multiprogramming workloads.
We can see from the results that the high memory intensive
workloads get the most performance gains compared to their less
memory intensive counterparts while gains for the mixed
memory intensive workloads fall in between the two. This is
because high memory intensive workloads need to access offchip data more often. As our scheme selectively prefetches data
into prefetch buffer to increase prefetch buffer utilization and
reduce the row-buffer conflicts, it is more likely that requested
data could be found in prefetch buffer, which turns into lower
main-memory access latency.
5.2 Row-Buffer Conflict Reduction
Figure 6. shows row-buffer conflict reduction for various
prefetching schemes. The BASE scheme is not included here
because the whole row is prefetched every time a row is opened
in the row-buffer and precharged after being copied to prefetch
buffer, so there are no row-buffer conflicts. Of course, compared
with other schemes, the BASE scheme will more likely fetch
unnecessary rows to the prefetch buffer. As our scheme takes
potential row-buffer conflicts into account while making prefetch
decision, it considerably reduces row-buffer conflicts. As shown
in the figure, our scheme reduces row-buffer conflicts on average
by 16.3% across all banks within vaults, compared to BASE-HIT,
and by 13.6% compared to MMD. Other schemes do not take into
account the conflicts within banks while making prefetch
decisions. This reduction in row-buffer conflicts leads to reduced
average memory access time (AMAT), which results in overall
speedup of program execution.
CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube
buffer conflicts and prefetch them to the buffer based on repeated
access within given window of time. Leveraging this makes
accesses to the main memory less contentious. As a result, there
is a reduction is average memory access time. Results shows that
our scheme reduces AMAT by 26% compared to base and 16.3%
compared to MMD on average.
Percentage Row Buffer Conflicts Over
Different Schemes (lower is better)
MMD
CAMPS
CAMPS-MOD
50%
40%
Reduction in Memory Access Latency
(higher is better)
30%
20%
40%
20%
15%
10%
5%
Figure 7. shows the prefetching accuracy of our proposed
scheme as compared to others. Prefetching accuracy is defined as
out of all the prefetches performed, how much prefetched data is
actually referenced by the processor. A higher accuracy is desired
for the prefetching scheme to be efficient. Our scheme CAMPSMOD has an accuracy of 70.5% on average and outperforms
BASE, BASE-HIT, and MMD by 33.3%, 28.4, and 4.1%,
respectively. Note that our scheme CAMPS without optimized
prefetch buffer management has an accuracy slightly lower than
MMD (by 1.5%), which leads us to introduce recency and
utilization based prefetch-buffer management that evicts rows
that have lower utilization and have not been accessed recently.
Prefetching Accuracy of Different Schemes
(higher is better)
90%
BASE
BASE-HIT
MMD
CAMPS
CAMPS-MOD
AVG
MX4
MX3
MX2
LM4
MX1
LM3
Multiprogrammed Workloads
Figure 8. Reduction in Average Memory Access Time
5.5 HMC Energy Consumption
Figure 9. shows average energy consumption of HMC across all
vaults for different prefetching schemes normalized to the BASE
scheme. As the BASE scheme fetches a whole row to the prefetch
buffer and has lowest overall prefetching accuracy, it leads to
more frequent replacements of rows from the prefetch buffer
back to memory bank, making this scheme consuming the most
energy. As can be seen in the figure, the MMD and CAMPS-MOD
schemes consume less energy than BASE scheme by 6.0% and
8.5% respectively. The energy savings are mainly due to fewer
activation and precharge operations.
80%
Average HMC Energy Consumption
normalized to BASE (lower is better)
70%
60%
50%
5.4 Reduction in Memory Access Latency
Figure 8. shows the reduction in average memory access time
(AMAT) of our scheme and MMD compared to BASE. Our
proposed scheme sorts out the rows that tend to cause more row
1
0.8
0.7
0.6
AVG
MX4
MX3
MX2
MX1
LM4
0.5
LM3
Figure 7. Prefetching Accuracy
CAMPS-MOD
0.9
LM2
MultiProgrammed Workloads
MMD
LM1
AVG
MX4
MX3
MX2
MX1
LM4
LM3
LM2
LM1
HM4
HM3
HM2
HM1
0%
HM4
10%
HM3
20%
BASE
1.1
HM2
30%
HM1
40%
Normalized Energy Consumption
Prefetching Accuracy
100%
LM2
5.3 Prefetching Accuracy
LM1
0%
HM1
Figure 6. Row Buffer Conflict Reduction
25%
HM4
MultiProgrammed Workloads
CAMPS-MOD
30%
HM3
AVG
MX4
MX3
MX2
MX1
LM4
LM3
LM2
LM1
HM4
HM3
HM2
HM1
0%
MMD
35%
HM2
10%
Percentage Reduction
Row Buffer Conflict Rate
60%
BASE-HIT
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
Multiprogrammed Workloads
Figure 9. Average Energy consumption of HMC
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
6. RELATED WORK
6.1 Prefetching
Prefetching has been a widely studied technique to improve
system performance. Several prior research studies have
proposed core-side prefetching [18, 23, 26], where prefetch
requests are initiated by an engine in cache hierarchy and
memory-side prefetching [8, 10, 25] where prefetching
mechanism are implemented in memory controller.
Hur and Lin [10] proposed an adaptive stream detection
prefetching policy that dynamically adjusts the aggressiveness of
prefetcher based on spatial locality in the running application.
The prefetcher detects a stream of n successive cachelines
accessed and issues a prefetch for n+1 and continues prefetching
until prefetched cachelines are not utilized. Our scheme on the
other hand, works at the granularity of row-buffer size and
prefetches a whole row to prefetch buffer and selectively pushes
data to the caches based on requests. This way, our scheme can
avoid pushing useless data block to caches, hence avoiding cache
pollution.
Panda et al. [25] proposed combination of three different
lightweight memory-side prefetchers aiming to improve the
performance of GPU-PIM systems. Their work is mostly targeting
the GPU based real-world applications that exhibit more
localities, while our scheme is suitable for any general-purpose
workload. Jog et al. [30] proposed a memory-side prefetcher
which prefetches cachelines that have not been touched within a
DRAM row currently opened in the row-buffer and aims to
improve row buffer locality.
Muhammad M. Rafique and Zhichun Zhu
multiprogramming workloads. It also outperforms an existing
memory-side prefetching scheme by 8.7% on average. Our
scheme introduces minimum hardware overheads that can easily
be accommodated in the logic base of HMC and does not need
modification to the main memory controller residing on
processor.
ACKNOWLEDGEMENT
We thank the anonymous reviewers for their constructive
comments. This work is supported in part by the National Science
Foundation under grants CCF-1513899. Any opinions, findings,
and conclusions or recommendations expressed in this material
are those of the authors and do not necessarily reflect the views
of the National Science Foundation.
REFERENCES
[1]
W.A Wulf and S.A. McKee, “Hitting the Memory Wall; Implications of the
Obvious” ACM SIGARCH Computer Architecture News, 23(1), 1995.
[2]
“Hybrid Memory Cube Specification 2.1” Hybrid Memory Cube Consortium.
2015.
[3]
A. Hansson, N. Agarwal, A. Kolli, T. Wenisch, A. N. Udipi, “Simulating
DRAM controllers for future system architecture exploration”, ISPASS, 2014.
[4]
[5]
“The gem5 Simulator”, http://www.gem5.org.
[6]
J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM SIGARCH
Computer. Archit. News, 34(4), 1-17, 2006.
[7]
M.Oskin, F. T. Chong, and T. Sherwood, “Active Pages: A computation model
for intelligent memory.” ISCA, 1998, 192-203.
[8]
P. Yedlapalli, J. Kotra, E. Kultursay, M. Kandemir, C. R. Das, A.
Sivasubramaniam, “Meeting Midway: Improving CMP performance with
memory-side prefetching” in Proceedings of 22th International Conference on
Parallel Architectures and Compilation Techniques, 2013, 289-298.
[9]
D. Orteaga et al., “Cost-effective compiler directed memory prefetching and
bypassing”, PACT, 2002.
6.2 Memory-Side Prefetching in HMC
With HMC being relatively new technology, very few studies
concentrated on implementing the memory-side prefetching with
HMC based main memory system. Other related memory-side
prefetching schemes assume prefetch buffer located inside the
memory controller on processor chip.
Ahn et al. [13], studied energy efficiency of HMC-based main
memory system in the presence of a conservative core-side
prefetcher and an aggressive memory-side prefetcher with
varying degree and distance of prefetching and using LRU policy
for prefetch buffer management. On the other hand, our work
implements an efficient memory-side prefetcher which monitors
access patterns to memory banks and prefetches rows that cause
most row buffer conflicts along with the rows that have more hits
within the row buffer. The prefetch buffer is managed by taking
into account the utilization and recency of data within the buffer.
7. CONCLUSION
In this paper, we propose CAMPS, a conflict-aware memory-side
prefetching scheme for HMC based main memory system, that
makes intelligent prefetching decisions by keeping track of the
utilization of rows currently opened in the row buffers and the
probability of rows to cause row buffer conflicts. In order to
maximize the utilization of prefetched data, we further introduce
a prefetch buffer management scheme that makes replacement
decision within prefetch buffer based on recency and utilization
of rows. Our proposed scheme improves performance by 17.9%
on average over the baseline scheme for a wide variety of
D. Joseph and D. Grunwald, “Prefetching using Markov predictors” ISCA,
1997.
[10] I. Hur and C. Lin, “Memory prefetching using adaptive stream detection”,
MICRO, 2006.
[11] E. Azarkhish, D. Rossi, I. L. Dei, “High performance AXI-4.0 based
interconnect for extensible smart memory cube”, DATE, 2015.
[12] N.P. Jouppi, “Improving direct-mapped cache performance by the addition of
the fully-associative cache and prefetch buffer,” in Proceeding of 17 th Annual
International Symposium on Computer Architecture, 1990, 364-373.
[13] J. Ahn, S. Yoo, and K. Choi, “Low-power hybrid memory cubes with link power
management and two-level prefetching” IEEE Trans. on VLSI Systems, 4(2),
2016.
[14] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis,
R. Thomas and K.Yelick “A case for intelligent ram: IRAM” IEEE Micro,1997,
17(2), 34-44.
[15] S. Srinath, O. Mutlu, H. Kim and Y.N. Patt, “Feedback directed prefetching:
Improving the performance and bandwidth efficiency of hardware
prefetchers,” in Proc. IEEE 13th Intl. Symp, High Perform. Comput. Arch. ,
2007, 63-74.
[16] M. Jung, C. Weis, N. When, “Optimized active and power-down mode refresh
control in 3D-DRAMs”, VLSI-SoC, 2014.
[17] Y. Wu, “Efficient discovery of regular stride patterns in irregular programs and
its use in compiler prefetching”, PLDI, 2002.
[18] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis,
C. C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke and T. Tateyama. “Impulse:
building a smarter memory controller”, HPCA, 1999, Proceedings. Fifth
International Symposium On, 70-79.
CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube
[19] C. J. Hughes and S.V. Adve, “Memory-side prefetching for linked data
structures for processor-in-memory systems,” journal of PDC, 2005.
[20] Standard
Performance
https://www.spec.org/
and
Evaluation
Corporation
(SPEC).
[21] G. Kim, Jo. Kim, J. H. Ahn, Ja. Kim, “Memory-centric system interconnect
design with hybrid memry cubes”, PACT, 2013.
[22] Porterfield, A. Kennedy, “Software methods for improvement of cache
performance on super-computing application”, Thesis, Rice University, 1989.
[23] K. J. Nesbit and J. E. Simth, “Data cache prefetching using a global history
buffer”, in Proceedings of 10th International Symposium on High Performance
Computer Architecture, 2004, 90-97.
[24] C. J. Wu et al, “Pacman:
prefetch-aware cache management for high
performance caching”, MICRO, 2011.
[25] R. Panda, Y. Eckert, N. Jayasena, O. Kayiran, M. Boyer, L. K. John, “Prefetching
techniques for near-memory throughput processors”, ICS, 2016.
[26] Y. Solihin et al., “Correlation prefetching with a user-level memory thread”
IEEE Trans. Parallel Distrib. Syst., 14(6), 2003.
[27] K. J. Nebit, A. S. Dhodapkar, J. E. Smith, “AC/DC: an adaptive data cache
prefetcher”, in PACT, 2004.
[28] Boncheol Gu et al “Biscuit: A framework for Near-Data Processing of Big data
Workloads”, in ISCA 2016.
[29] D.K. Poulsen and P. C. Yew, “Data prefetching and data forwarding in shared
memory multiprocessors”, in ICPP, 1994.
[30] A. Jog, O. Kayiran, K. Mishra and M. T. K. “Owl: Cooperative thread array
aware scheduling techniques for improving GPGPU performance”, in ASPLOS
2013.
[31] S. Rixner, W. J. Dally, U.J. Kapasi, P. Mattson, and J.D. Owens, “Memory access
scheduling,” in Proc. 27th Annu. Int. Symp. Comput. Archit., 2000, pp.128-138.
[32] D. H. Woo, N. H. Seong, D. L. Lewis, H. S. Lee, “An optimized 3-D stacked
memory architecture by exploiting excessive, high-density TSV bandwidth”,
HPCA, 2010.
ICPP 2018, August 13-16, 2018, Eugene, OR, USA
Download