CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube Muhammad M. Rafique Zhichun Zhu Department of Electrical and Computer Engineering University of Illinois at Chicago Chicago, Illinois, USA mrafiq2@uic.edu Department of Electrical and Computer Engineering University of Illinois at Chicago Chicago, Illinois, USA zzhu@uic.edu ABSTRACT Prefetching is a well-studied technique where data is fetched from main memory ahead of time speculatively and stored in caches or dedicated prefetch buffer. With the introduction of Hybrid Memory Cube (HMC), a 3-D memory module with multiple memory layers stacked over a single logic layer using thousands of Through Silicon Vias (TSVs), huge internal bandwidth availability makes memory-side prefetching a more efficient approach to improving system performance. In this paper, we introduce a memory-side prefetching scheme for HMC based main memory system that utilizes its logic area and exploits the huge internal bandwidth provided by TSVs. Our scheme closely monitors the access pattern to memory banks and make intelligent prefetch decisions for rows with high utilization or causing row buffer conflicts. We also introduce a prefetch buffer management scheme that makes replacement decision within the prefetch buffer based on both the utilization and recency of the prefetched rows. Our simulation results indicate that our approach improves performance by 17.9% on average, compared to a baseline scheme that prefetches a whole row on every memory request. Our scheme also outperforms an existing memory-side prefetching scheme by 8.7% on average, which dynamically adjusts the prefetch degree based on the usefulness of prefetched data. In this sample-structured document, neither the cross-linking of float elements and bibliography nor metadata/copyright information is available. The sample document is provided in “Draft” mode and to view it in the final layout format, applying the required template is essential with some standard steps. CCS CONCEPTS Computer systems organization → Processors and memory architectures; ● Hardware → Semiconductor memory ● Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. ICPP 2018, August 13–16, 2018, Eugene, OR, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-6510-9/18/08…$15.00 https://doi.org/10.1145/3225058.3225112 KEYWORDS Hybrid Memory Cube (HMC), Prefetch Buffer, Memory-Side Prefetching, Memory Bandwidth. ACM Reference format: Muhammad Rafique, Zhichun Zhu. 2018. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube. In ICPP 2018: 47th International Conference on Parallel Processing, August 13–16, 2018, Eugene, OR, USA. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3225058.3225112 1. INTRODUCTION The increasing speed and bandwidth gap between processor and memory, commonly referred to as “memory wall” problem [1], is a major performance bottleneck in today’s computer systems. This problem gets even worse as we enter the era of many-core systems and big-data applications. The traditional DDRx-based main memory system fails to cope with the increasing bandwidth and speed requirement from the processor. The off-chip data access latency, limited bandwidth, and random request in multicore environment are major driving factors for the researchers to find alternative approach for main memory organization. One of the ideas is to reduce the off-chip data access latency by bringing processing closer to data, known as Processing-In-Memory (PIM) [7], [14], [18]. Although this approach is promising to hide data access latency, challenge on integrating logic and memory has been a major obstacle in turning this idea into successful commercial product. Recently, this idea revives again when Micron® introduced Hybrid Memory Cube (HMC) [2], a 3-D structure where multiple memory layers are stacked over a single logic layer and connected by Through Silicon Vias (TSVs). This 3-D stacking of memory over logic area mitigates the problem of bandwidth gap between processor and main memory to much extent and reduces the total energy consumption per useful unit of work done [2]. The 3-D stacking of DRAM on top of logic layer not only opens whole new possibilities on the way we manage main memory requests, but also the room for improvement of existing techniques used to tackle the memory wall problem. One of the techniques used to hide data access latency is “Prefetching”, where data is brought in speculatively ahead of actual need and is stored near the processing element with minimum access latency. While prefetching tends to improve system performance, it comes with overheads of increased memory bandwidth requirement and energy consumption. Prefetch transactions also compete with demand accesses to the main memory for precious resources like memory bandwidth and cache space. Both core- ICPP 2018, August 13-16, 2018, Eugene, OR, USA side prefetching [18, 23, 26] and memory-side prefetching [8, 19, 25] have been studied extensively. There are several design considerations to perform prefetching such as when and which data should be prefetched and where the prefetched data should be stored. With the 3-D stacking of DRAM over a logic layer, the optimal design choice of memory-side prefetching may become different from those existing schemes targeting conventional DDRx-based main memory system as it provides substantial improvement in bandwidth and the availability of logic area near memory. In this paper, we propose CAMPS, Conflict Aware Memoryside Prefetching Scheme which stores prefetched data in a dedicated buffer located in the logic base of HMC. The data is moved between memory bank and prefetch buffer at the granularity of row buffer size, taking full advantage of thousands of TSVs, which are not available in DDRx-based main memory systems. The main idea is to closely monitor the access patterns to different rows within a bank and prefetch a whole row to the prefetch buffer based on two criteria. First, it prefetches those rows that tend to cause more row buffer conflicts. Second, it prefetches those rows that have higher utilization i.e. with multiple blocks within the row being accessed. We also propose a replacement policy designed specifically for the prefetch buffer. When a replacement is needed, the decision is made dynamically based on both the utilization and recency of the prefetched rows. Our scheme implements prefetching on memory-side and prefetched data is not aggressively pushed to caches. This avoids cache pollution and saves energy that otherwise would be consumed by moving unutilized data between caches and main memory. The following are the major contributions of this work: • We explore the memory-side prefetching in Hybrid Memory Cube by extensively utilizing its huge internal bandwidth and its logic base. • We introduce a prefetch decision mechanism that prefetches rows based on utilization and possibility of causing row buffer conflicts. • We propose a prefetch buffer management policy that makes replacement decisions based on the utilization of prefetch buffer data along with the recency of rows in the prefetch buffer. Simulation results indicate that our scheme can effectively improve the system performance. Compared to a baseline scheme that prefetches a whole row on every memory request, our approach improves performance by 17.9% on average for eight core multiprogramming SPEC CPU2006 workloads. When compared to an existing memory-side prefetching scheme that dynamically adjusts the prefetch degree based on the usefulness of prefetched data and uses traditional LRU replacement policy for prefetch buffer, our scheme can improve performance by 8.7% on average. The rest of the paper is organized as follows: Section II briefly describes the HMC structure and basics of prefetching mechanism and motivation for our work. Section III describes our proposed prefetching scheme in detail. Section IV describes our experimental methodology. Analyses of the effectiveness of our approach are presented in section V. Section VI shows other related work and Section VII concludes our paper. Muhammad M. Rafique and Zhichun Zhu 2. BACKGROUND AND MOTIVATION In this section, we briefly review HMC architecture, prefetching and design considerations for effective prefetching mechanism. We also analyze how we can transform memory-side prefetching to new memory technologies and show why HMC architecture can be best utilized for memory-side prefetching. 2.1 HMC Architecture A hybrid memory cube (HMC) consists of multiple memory layers stacked over a single logic layer as shown in Figure 1. Figure 1: HMC Structure Each memory layer along with the logic layer is further partitioned into slices. A single slice of logic layer and multiple layers of memory partitions are connected with multiple Through Silicon Vias (TSVs), forming a column-like structure called vault. Each vault is functionally independent and has its own vault controller residing in its logic base. The vault controller manages the lower level DRAM commands like address mapping, refreshing and memory access scheduling, taking this burden off the main memory controller residing on processor chip. A single memory layer partition is further divided into multiple banks that can be accessed concurrently. A single vault can have as many as 16 banks (two banks per vault-layer with 8 DRAM layers), which makes up to 512 banks available in a single HMC package with 32-vault configuration [2]. This organization of banks has a huge potential of exploiting bank-level parallelism, as this feature is critical for random requests coming from today’s multi-core and many-core processors. The processor is connected to HMC through very high-speed serial I/O links operating at as much as 30 Gbp/s. All the serial links are connected to the vault controllers through a crossbar switch that routes the request packet coming from the processor to a particular vault controller. These serial links operate at very high frequency and along with thousands of TSVs connecting multiple memory layers to vault controller, collectively provide enormous amount of bandwidth that reaches up to 320Gbp/s. Figure 2 shows architecture of HMC-based main memory system equipped with memory-side prefetcher. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube 2.2 Prefetching Prefetching is a well-known technique used to speed up program execution by fetching data speculatively ahead of its actual need. The idea is to bring additional cache blocks from main memory that might be useful in future along with the requested blocks. Although, prefetching tends to improve performance, it may also degrade performance if not done carefully. There are some design considerations for prefetching which we will discuss shortly. Prefetching can be implemented in either software or hardware. In software-based prefetching, the prefetch instructions can be inserted by the compiler [22] or by the programmer. This type of prefetching gives programmer control over program execution but also increases the code footprint. In hardware-based prefetching, memory access patterns are monitored by specialized hardware and data is prefetched to either caches or dedicated prefetch buffer, based on certain prediction mechanism that takes advantage of previous access patterns, bandwidth availability. Our scheme deploys an effective hardware-based memory-side prefetching, utilizing the huge internal bandwidth and logic area available near data storage. ICPP 2018, August 13-16, 2018, Eugene, OR, USA prefetch is another factor to be taken into consideration. Prefetching should be done just in time to gain maximum benefits. Too early prefetching may cause prefetched data eviction even before potential use. Too late prefetching essentially kills the purpose of prefetching as it would not hide data access latency. Finally, question regarding where to store prefetched data should be answered. Prefetched data can be brought to caches directly along with demanded data or it can also be placed in a separate prefetch buffer. Storing in caches would reduce the effective cache capacity for storing actual requested data while separate prefetch buffer would require additional space. In summary, an efficient prefetching mechanism correctly predicts and fetches the data in timely fashion and stores the data where it can be accessed with minimum overhead. 2.4 Motivation Keeping key design considerations in mind and analyzing the advantages of HMC architecture explained earlier, we now explain how it can be best exploited to design an efficient memory-side prefetching mechanism. As mentioned earlier, the placement of prefetched data is key design choice while implementing efficient hardware prefetching. Several previous proposals suggest that the prefetched data should be brought into caches [15, 24], so that it can be readily available to the processor core with minimum latency. In such case, the effective capacity of cache for demand requests would be reduced and prefetched data might get evicted without use. The alternate approach to this problem is to prefetch data into a dedicated prefetch buffer residing in the memory controller and push it to the caches when requested [8, 10, 19, 26]. The idea of memory-side prefetching gained attention because it has more information on actual situation of memory related parameters like channel availability and state of row buffer. Prefetch can be initiated only when channel is available to carry out transaction for prefetching. Core-side prefetching on the other hand, relies on prediction algorithms and has little knowledge of memory bank state. Aggressive prefetching could be a waste of resources such as memory bandwidth and energy, and it might delay the demand requests due to additional network traffic congestion [24]. Figure 2. HMC Architecture equipped with Memory-Side Prefetcher 2.3 Prefetching Design Considerations An effective prefetching scheme improves system performance by reducing average memory access time (AMAT) as prefetched data is available in a storage space that can be accessed with shorter latency. Prefetching comes with its own overheads. There are certain issues that need to be addressed to make sure the cost of implementing prefetching doesn’t exceed the benefits gained from it. For example, what to prefetch is the most important design consideration. Special attention should be paid in selecting candidate data for prefetching as unnecessary prefetched data cause energy overhead and additional delays for data that is actually requested by the processor from main memory. When to With the introduction of HMC and availability of more logic real estate to control memory operations, memory-side prefetching seems to be a more pragmatic approach for implementing memory-side prefetching due to the following reasons: • • • Prefetch buffer can easily be placed within vault controller. As vaults are functionally independent to each other, a vault controller can accommodate one buffer that would cater to requests destined for it. Due to availability of huge internal bandwidth made possible by thousands of TSVs [32], a whole row of data can be aggressively prefetched to the buffer without utilizing external links connected to processor, thus reducing congestion. As prefetching implemented within vaults doesn’t utilize external resources like I/O links, under-utilization of prefet- ICPP 2018, August 13-16, 2018, Eugene, OR, USA • • -ched data wouldn’t hurt as much as it would in case of coreside prefetching or memory-side prefetching in traditional DRAM based systems where prefetch buffer resides in the memory controller on processor chip. The prefetched data is not proactively pushed towards upper level caches, thus avoiding the cache pollution. It can be pushed only if requested, thereby turning “pre” fetching into “demand” fetching. All I/O serial links operates at much higher frequency as compared to regular memory buses [2], accessing off-chip prefetched data would incur shorter latency. 3. PROPOSED PREFETCHING SCHEME Considering the potential advantages of implementing memoryside prefetching on HMC platform described earlier, we propose a novel memory-side prefetching scheme that prefetches data by extensively utilizing huge internal bandwidth available in HMC vaults while not aggressively pushing it to the caches, except when requested. The memory-side prefetching scheme is implemented entirely in the vault controller residing in the logic base of HMC. Our scheme closely monitors the state of memory banks and prefetches data cautiously based on utilization of rows currently opened in the row buffer. It also profiles recently opened rows and prefetches the rows that tend to cause row buffer conflicts. Our scheme further considers the utilization and recency of each row residing in the prefetch buffer and makes intelligent replacement decisions. Muhammad M. Rafique and Zhichun Zhu The row utilization information kept in CT is used later to determine whether a row causes row buffer conflicts. When there is a row buffer miss, our scheme precharges the bank, activates the required row and serves the request as normal. In addition, if the newly opened row already has an entry in CT, this means the row has been accessed recently, which indicates that this row caused row-buffer conflict and is a good candidate for prefetching. After fetching this row to the prefetch buffer, its entry will be removed from the CT and the bank is precharged. If CT is full, our scheme uses LRU policy to evict information for the least recently accessed row. If the newly opened row has no entry in CT, then our scheme keeps that row open and inserts its information to RUT as explained above. Figure 3 summarizes our scheme. As we will show later in the results section, this approach considerably reduces the row buffer conflicts because both the frequently accessed rows with high utilization and the rows causing more row-buffer conflicts are prefetched to the prefetch buffer. As our scheme is based on aggressive prefetching [15], we will introduce another optimization scheme next to manage the prefetch buffer effectively and minimize the overheads of aggressive prefetching. 3.1 Conflict-Aware Prefetching The row buffer conflicts within memory banks are major cause of delays in accessing main memory. In conflict-aware prefetching, our scheme takes this factor into account by profiling and prefetching rows that tend to cause most row buffer conflicts. In this way, requests to these rows will be served from prefetch buffer, thus reducing row buffer conflicts. To implement this scheme, we use two different tables for profiling the behavior of rows. The first one is the Row Utilization Table (RUT) with one entry for each of the 16 banks in a vault, which keeps tracks of how many distinct cache lines have been accessed from a row that is in the row buffer. The second is the Conflict Table (CT) with 32 entries per vault to keep track of rows that have been replaced from row buffers recently. The conflict table has a fully associative structure, where its entries are shared by all the banks in a vault. Once the request from processor is sent to the memory, the vault controller will first check the prefetch buffer to see whether it has the requested data. If there is a miss in prefetch buffer, the controller will check whether this is a row-buffer hit. Upon the row-buffer hit, it serves the request from row buffer and stores the utilization of the row in the RUT if the information is not there yet. Each entry in RUT is a utilization counter, which is incremented every time that a request is served from the corresponding row. If the number of accesses to a row exceeds a threshold value (four in our experiment), our scheme fetches the whole row to the prefetch buffer and precharges bank to make it ready for next request. A different row access to the same bank replaces the previous row utilization entry from the bank in RUT and the replaced entry is moved to CT. Figure 3. CAMP Scheme 3.2 Utilization and Recency Based Prefetch Buffer Management Existing memory-side prefetching schemes choose least recently accessed entry in the prefetch buffer as a replacement candidate [8, 13]. To better utilize the limited number of entries in the prefetch buffer, our scheme considers the utilization and recency of each row that has been brought to prefetch buffer. Utilization of row is defined as the number of distinct cache lines referenced within that row. Recency of row is defined as when is the last time that this row has been referenced within prefetch buffer. Our scheme uses a utilization counter for each prefetched row, which is incremented every time a distinct cache line is accessed within the row. To keep track of recency, we use a separate recency counter for each prefetched row. When a row is accessed, the counter sets the value of 15 (the number of entries in prefetch buffer) to the most recently accessed row, while other CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube rows whose recency values were larger than the MRU row decrements. The least recently accessed row has recency counter value of 0. To make a replacement decision, our scheme will first check whether there is any row with all distinct cache lines accessed. If so, that row does not need to stay in the prefetch buffer since all its data have been transferred to the processor. If no such row exists, our scheme adds up the values of utilization counter and recency counter and replaces the row with the minimum sum. If two or more rows have the same minimum sum, the row with the lowest utilization count value will be evicted. Figure 4. summarizes this scheme. This approach gives highly utilized and frequently accessed rows longer life in prefetch buffer. As the results will show, it is a more efficient replacement scheme as compared to LRU. 4.1 Workloads We use SPEC CPU2006 [6, 20] benchmarks to form eight-core multiprogramming workloads. A total of 12 different sets of benchmarks have been chosen and categorized based on misses per kilo instructions (MPKI), [13]. The applications with MPKI value ≥ 20 are categorized as high memory intensive (HM) ones. Similarly, applications with MPKI values 20>MPKI≥1 are categorized as low memory intensive (LM) ones. Then we choose four applications from HM set and four from LM set to make a third working set of mixed (MX) benchmarks. Table II gives details of these different eight-core multiprogramming workloads. Parameter 8 cores @ 3GHz, issue width = 4, x86 ISA, out-oforder. Caches L1(I/D): 32KB pvt., 2-way, hit lat. = 2 cycles L2: 256 KB pvt., 4-way, hit lat. = 6 cycles L3: 16MB shrd.,16-way, hit lat. = 20 cycles, 64B line HMC 8 DRAM layers, 32 vaults, 2 banks per vault layer, 1kB row-buffer size. DDR3-1600, Queue Size (R/W) = 32 trcd= 11 cycles, trp=11 cycles, tcl= 11cycles Serial Links 4 serial links with 16 I/P and 16 O/P lanes for full duplex configuration, 12.5Gbps link. PF Buffer 16KB per vault, fully associative, 1KB line, hit latency = 22 cycles Address Mapping: RoRaBaVaCo (row-rank-bank-vault-column) Memory Scheduling: FR-FCFS [31] (first-ready, first-come-first-serve) Page Policy: Open Page Figure 4. Replacement policy for prefetch buffer Our proposed prefetching scheme introduces some area overheads. The row utilization table (RUT) consists of 16 entries per vault controller, with each entry being 20 bits, making the total hardware overhead of RUT to be 40 bytes. As we have 32 vault controllers, it gives a total overhead of 1.25KB. Similarly, we also have one conflict table (CT) per vault controller with 32 entries and each entry also being 20 bits. Hardware overhead of CT also turned out to be 2.5KB in total for 32 vault controllers. Thus, our scheme introduces a total overhead of 3.75KB over the baseline prefetcher architecture, which is trivial compared with the prefetch buffer itself. Note that our scheme has the same prefetch buffer size as other memory-side prefetching schemes used in the experiments. 4. EXPERIMENTAL METHODLOGY We evaluate our enhanced memory-side prefetching scheme using cycle-accurate x86 full system simulator gem5 [4]. Hybrid Memory Cube (HMC) is modeled based on [2, 11, 13, 21]. This simplified yet accurate HMC model is composed of vault controllers, serial links, main internal crossbar and external HMC controller. The detailed DRAM operations are based on model presented in [3] and [16]. The details of simulated system are shown in Table I. Configuration Processor DRAM Parameters 3.3 Overhead ICPP 2018, August 13-16, 2018, Eugene, OR, USA Vault Controller Table I: Experimental Configuration ID Benchmarks HM1 HM2 HM3 HM4 LM1 LM2 LM3 LM4 MX1 MX2 MX3 MX4 bwaves, gems, gcc, lbm, bwaves, gcc, lbm, gems milc, gems, sphinx, omnetpp, sphinx, milc, omnetpp, gems gcc, mcf, lbm, milc, mcf, gcc, milc, lbm sphinx, gcc, lbm, bwaves, sphinx, bwaves, lbm, gcc cactus, bzip2, astar, wrf, wrf, bzip2, cactus, astar tonto, zeusmp, h264ref, astar, zeusmp, h264ref, astar, tonto bzip2, zeusmp, cactus, tonto, cactus, zeusmp, bzip2, tonto astar, tonto, bzip2, h264ref, tonto, astar, bzip2, h264ref bwaves, gcc, cactus, wrf, cactus, gcc, wrf, bwaves gems, sphinx, tonto, h264ref, sphinx, gems, h264ref, tonto milc, lbm, wrf, bzip2, lbm, bzip2, milc, wrf gcc, bwaves, bzip2, astar, bwaves, gcc, bzip2, astar Table II: SPEC CPU2006 Benchmark Sets ICPP 2018, August 13-16, 2018, Eugene, OR, USA Muhammad M. Rafique and Zh ichun Zhu Performance Improvement Over Different Prefetching Schemes (higher is better) 1.5 BASE BASE-HIT MMD CAMPS CAMPS-MOD Normalized Speedup 1.4 1.3 1.2 1.1 1 0.9 0.8 Multiprogrammed Workloads Figure 5. Normalized performance gains of CAMPS with different schemes The experiments first fast-forward the simulations for two billion instructions from the beginning and warm up the caches for 100 million instructions. Then detailed simulations are run for 800 million instructions. 5. RESULTS AND ANALYSIS In this section, we present the results of our proposed scheme and compare it to multiple memory-side prefetching schemes. The first one is the baseline scheme, which prefetches a whole row at the first access to the row (called, BASE afterward). The second scheme prefetches a whole row if the row has two or more hits based on the requests in the read queue. This scheme is referred to as BASE-HIT. We also compare our scheme with an existing memory-side prefetching scheme [8], which dynamically adjusts the prefetch degree based on the usefulness of prefetched data and uses traditional LRU policy for prefetch buffer management (called MMD afterward). Our conflict aware prefetching scheme will be referred to as CAMPS and our utilization and recency based prefetch buffer management scheme, applied on the top of CAMPS is referenced as CAMPSMOD. 5.1 Performance Improvement Figure 5. shows the overall performance gains obtained by our schemes compared with different memory-side prefetching schemes mentioned above. The performance gains are calculated as the geometric mean of the IPC values of different workloads running on the eight processor cores. The performance of baseline is normalized to 1. The average gain for the high memory intensive workloads (HM) is 24.9% over the BASE, 16.3% over BASE-HIT, and 21.8% over the MMD, respectively. The performance gain for the low memory-intensive benchmarks (LM) is 9.4% over BASE, 5.1% over BASE-HIT, and 4.9% over the MMD, respectively. Finally, for mixed benchmarks (MX), our scheme improves performance by 19.6%, 16.1% and 8.7% over BASE, BASE-HIT, and MMD, respectively. On average, our scheme CAMPS-MOD out-performs BASE by 17.9%, BASE-HIT by 16.8%, and MMD by 8.7%, respectively, across a wide range of multiprogramming workloads. We can see from the results that the high memory intensive workloads get the most performance gains compared to their less memory intensive counterparts while gains for the mixed memory intensive workloads fall in between the two. This is because high memory intensive workloads need to access offchip data more often. As our scheme selectively prefetches data into prefetch buffer to increase prefetch buffer utilization and reduce the row-buffer conflicts, it is more likely that requested data could be found in prefetch buffer, which turns into lower main-memory access latency. 5.2 Row-Buffer Conflict Reduction Figure 6. shows row-buffer conflict reduction for various prefetching schemes. The BASE scheme is not included here because the whole row is prefetched every time a row is opened in the row-buffer and precharged after being copied to prefetch buffer, so there are no row-buffer conflicts. Of course, compared with other schemes, the BASE scheme will more likely fetch unnecessary rows to the prefetch buffer. As our scheme takes potential row-buffer conflicts into account while making prefetch decision, it considerably reduces row-buffer conflicts. As shown in the figure, our scheme reduces row-buffer conflicts on average by 16.3% across all banks within vaults, compared to BASE-HIT, and by 13.6% compared to MMD. Other schemes do not take into account the conflicts within banks while making prefetch decisions. This reduction in row-buffer conflicts leads to reduced average memory access time (AMAT), which results in overall speedup of program execution. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube buffer conflicts and prefetch them to the buffer based on repeated access within given window of time. Leveraging this makes accesses to the main memory less contentious. As a result, there is a reduction is average memory access time. Results shows that our scheme reduces AMAT by 26% compared to base and 16.3% compared to MMD on average. Percentage Row Buffer Conflicts Over Different Schemes (lower is better) MMD CAMPS CAMPS-MOD 50% 40% Reduction in Memory Access Latency (higher is better) 30% 20% 40% 20% 15% 10% 5% Figure 7. shows the prefetching accuracy of our proposed scheme as compared to others. Prefetching accuracy is defined as out of all the prefetches performed, how much prefetched data is actually referenced by the processor. A higher accuracy is desired for the prefetching scheme to be efficient. Our scheme CAMPSMOD has an accuracy of 70.5% on average and outperforms BASE, BASE-HIT, and MMD by 33.3%, 28.4, and 4.1%, respectively. Note that our scheme CAMPS without optimized prefetch buffer management has an accuracy slightly lower than MMD (by 1.5%), which leads us to introduce recency and utilization based prefetch-buffer management that evicts rows that have lower utilization and have not been accessed recently. Prefetching Accuracy of Different Schemes (higher is better) 90% BASE BASE-HIT MMD CAMPS CAMPS-MOD AVG MX4 MX3 MX2 LM4 MX1 LM3 Multiprogrammed Workloads Figure 8. Reduction in Average Memory Access Time 5.5 HMC Energy Consumption Figure 9. shows average energy consumption of HMC across all vaults for different prefetching schemes normalized to the BASE scheme. As the BASE scheme fetches a whole row to the prefetch buffer and has lowest overall prefetching accuracy, it leads to more frequent replacements of rows from the prefetch buffer back to memory bank, making this scheme consuming the most energy. As can be seen in the figure, the MMD and CAMPS-MOD schemes consume less energy than BASE scheme by 6.0% and 8.5% respectively. The energy savings are mainly due to fewer activation and precharge operations. 80% Average HMC Energy Consumption normalized to BASE (lower is better) 70% 60% 50% 5.4 Reduction in Memory Access Latency Figure 8. shows the reduction in average memory access time (AMAT) of our scheme and MMD compared to BASE. Our proposed scheme sorts out the rows that tend to cause more row 1 0.8 0.7 0.6 AVG MX4 MX3 MX2 MX1 LM4 0.5 LM3 Figure 7. Prefetching Accuracy CAMPS-MOD 0.9 LM2 MultiProgrammed Workloads MMD LM1 AVG MX4 MX3 MX2 MX1 LM4 LM3 LM2 LM1 HM4 HM3 HM2 HM1 0% HM4 10% HM3 20% BASE 1.1 HM2 30% HM1 40% Normalized Energy Consumption Prefetching Accuracy 100% LM2 5.3 Prefetching Accuracy LM1 0% HM1 Figure 6. Row Buffer Conflict Reduction 25% HM4 MultiProgrammed Workloads CAMPS-MOD 30% HM3 AVG MX4 MX3 MX2 MX1 LM4 LM3 LM2 LM1 HM4 HM3 HM2 HM1 0% MMD 35% HM2 10% Percentage Reduction Row Buffer Conflict Rate 60% BASE-HIT ICPP 2018, August 13-16, 2018, Eugene, OR, USA Multiprogrammed Workloads Figure 9. Average Energy consumption of HMC ICPP 2018, August 13-16, 2018, Eugene, OR, USA 6. RELATED WORK 6.1 Prefetching Prefetching has been a widely studied technique to improve system performance. Several prior research studies have proposed core-side prefetching [18, 23, 26], where prefetch requests are initiated by an engine in cache hierarchy and memory-side prefetching [8, 10, 25] where prefetching mechanism are implemented in memory controller. Hur and Lin [10] proposed an adaptive stream detection prefetching policy that dynamically adjusts the aggressiveness of prefetcher based on spatial locality in the running application. The prefetcher detects a stream of n successive cachelines accessed and issues a prefetch for n+1 and continues prefetching until prefetched cachelines are not utilized. Our scheme on the other hand, works at the granularity of row-buffer size and prefetches a whole row to prefetch buffer and selectively pushes data to the caches based on requests. This way, our scheme can avoid pushing useless data block to caches, hence avoiding cache pollution. Panda et al. [25] proposed combination of three different lightweight memory-side prefetchers aiming to improve the performance of GPU-PIM systems. Their work is mostly targeting the GPU based real-world applications that exhibit more localities, while our scheme is suitable for any general-purpose workload. Jog et al. [30] proposed a memory-side prefetcher which prefetches cachelines that have not been touched within a DRAM row currently opened in the row-buffer and aims to improve row buffer locality. Muhammad M. Rafique and Zhichun Zhu multiprogramming workloads. It also outperforms an existing memory-side prefetching scheme by 8.7% on average. Our scheme introduces minimum hardware overheads that can easily be accommodated in the logic base of HMC and does not need modification to the main memory controller residing on processor. ACKNOWLEDGEMENT We thank the anonymous reviewers for their constructive comments. This work is supported in part by the National Science Foundation under grants CCF-1513899. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] W.A Wulf and S.A. McKee, “Hitting the Memory Wall; Implications of the Obvious” ACM SIGARCH Computer Architecture News, 23(1), 1995. [2] “Hybrid Memory Cube Specification 2.1” Hybrid Memory Cube Consortium. 2015. [3] A. Hansson, N. Agarwal, A. Kolli, T. Wenisch, A. N. Udipi, “Simulating DRAM controllers for future system architecture exploration”, ISPASS, 2014. [4] [5] “The gem5 Simulator”, http://www.gem5.org. [6] J. L. Henning, “SPEC CPU2006 benchmark descriptions,” ACM SIGARCH Computer. Archit. News, 34(4), 1-17, 2006. [7] M.Oskin, F. T. Chong, and T. Sherwood, “Active Pages: A computation model for intelligent memory.” ISCA, 1998, 192-203. [8] P. Yedlapalli, J. Kotra, E. Kultursay, M. Kandemir, C. R. Das, A. Sivasubramaniam, “Meeting Midway: Improving CMP performance with memory-side prefetching” in Proceedings of 22th International Conference on Parallel Architectures and Compilation Techniques, 2013, 289-298. [9] D. Orteaga et al., “Cost-effective compiler directed memory prefetching and bypassing”, PACT, 2002. 6.2 Memory-Side Prefetching in HMC With HMC being relatively new technology, very few studies concentrated on implementing the memory-side prefetching with HMC based main memory system. Other related memory-side prefetching schemes assume prefetch buffer located inside the memory controller on processor chip. Ahn et al. [13], studied energy efficiency of HMC-based main memory system in the presence of a conservative core-side prefetcher and an aggressive memory-side prefetcher with varying degree and distance of prefetching and using LRU policy for prefetch buffer management. On the other hand, our work implements an efficient memory-side prefetcher which monitors access patterns to memory banks and prefetches rows that cause most row buffer conflicts along with the rows that have more hits within the row buffer. The prefetch buffer is managed by taking into account the utilization and recency of data within the buffer. 7. CONCLUSION In this paper, we propose CAMPS, a conflict-aware memory-side prefetching scheme for HMC based main memory system, that makes intelligent prefetching decisions by keeping track of the utilization of rows currently opened in the row buffers and the probability of rows to cause row buffer conflicts. In order to maximize the utilization of prefetched data, we further introduce a prefetch buffer management scheme that makes replacement decision within prefetch buffer based on recency and utilization of rows. Our proposed scheme improves performance by 17.9% on average over the baseline scheme for a wide variety of D. Joseph and D. Grunwald, “Prefetching using Markov predictors” ISCA, 1997. [10] I. Hur and C. Lin, “Memory prefetching using adaptive stream detection”, MICRO, 2006. [11] E. Azarkhish, D. Rossi, I. L. Dei, “High performance AXI-4.0 based interconnect for extensible smart memory cube”, DATE, 2015. [12] N.P. Jouppi, “Improving direct-mapped cache performance by the addition of the fully-associative cache and prefetch buffer,” in Proceeding of 17 th Annual International Symposium on Computer Architecture, 1990, 364-373. [13] J. Ahn, S. Yoo, and K. Choi, “Low-power hybrid memory cubes with link power management and two-level prefetching” IEEE Trans. on VLSI Systems, 4(2), 2016. [14] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas and K.Yelick “A case for intelligent ram: IRAM” IEEE Micro,1997, 17(2), 34-44. [15] S. Srinath, O. Mutlu, H. Kim and Y.N. Patt, “Feedback directed prefetching: Improving the performance and bandwidth efficiency of hardware prefetchers,” in Proc. IEEE 13th Intl. Symp, High Perform. Comput. Arch. , 2007, 63-74. [16] M. Jung, C. Weis, N. When, “Optimized active and power-down mode refresh control in 3D-DRAMs”, VLSI-SoC, 2014. [17] Y. Wu, “Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching”, PLDI, 2002. [18] J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C. C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke and T. Tateyama. “Impulse: building a smarter memory controller”, HPCA, 1999, Proceedings. Fifth International Symposium On, 70-79. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube [19] C. J. Hughes and S.V. Adve, “Memory-side prefetching for linked data structures for processor-in-memory systems,” journal of PDC, 2005. [20] Standard Performance https://www.spec.org/ and Evaluation Corporation (SPEC). [21] G. Kim, Jo. Kim, J. H. Ahn, Ja. Kim, “Memory-centric system interconnect design with hybrid memry cubes”, PACT, 2013. [22] Porterfield, A. Kennedy, “Software methods for improvement of cache performance on super-computing application”, Thesis, Rice University, 1989. [23] K. J. Nesbit and J. E. Simth, “Data cache prefetching using a global history buffer”, in Proceedings of 10th International Symposium on High Performance Computer Architecture, 2004, 90-97. [24] C. J. Wu et al, “Pacman: prefetch-aware cache management for high performance caching”, MICRO, 2011. [25] R. Panda, Y. Eckert, N. Jayasena, O. Kayiran, M. Boyer, L. K. John, “Prefetching techniques for near-memory throughput processors”, ICS, 2016. [26] Y. Solihin et al., “Correlation prefetching with a user-level memory thread” IEEE Trans. Parallel Distrib. Syst., 14(6), 2003. [27] K. J. Nebit, A. S. Dhodapkar, J. E. Smith, “AC/DC: an adaptive data cache prefetcher”, in PACT, 2004. [28] Boncheol Gu et al “Biscuit: A framework for Near-Data Processing of Big data Workloads”, in ISCA 2016. [29] D.K. Poulsen and P. C. Yew, “Data prefetching and data forwarding in shared memory multiprocessors”, in ICPP, 1994. [30] A. Jog, O. Kayiran, K. Mishra and M. T. K. “Owl: Cooperative thread array aware scheduling techniques for improving GPGPU performance”, in ASPLOS 2013. [31] S. Rixner, W. J. Dally, U.J. Kapasi, P. Mattson, and J.D. Owens, “Memory access scheduling,” in Proc. 27th Annu. Int. Symp. Comput. Archit., 2000, pp.128-138. [32] D. H. Woo, N. H. Seong, D. L. Lewis, H. S. Lee, “An optimized 3-D stacked memory architecture by exploiting excessive, high-density TSV bandwidth”, HPCA, 2010. ICPP 2018, August 13-16, 2018, Eugene, OR, USA