Application Specific Logic in Memory Qiuling Zhu, Berkin Akin, Huseyin Ekin Sumbul 1. Motivation and Project Definition Memory hierarchy, data prefetching, as well as out-of-order execution are all popular techniques to tolerant long memory access latencies in existing processors, especially for applications that fit nicely into their caches or with enough localities for the caches. However, these platforms come up with bottlenecks for data/memory intensive applications or applications with irregular data access patterns. For example, Digital signal processing (DSP) or scientific computation often fail to meet CPU-speed based expectations by a wide margin. And prefetchers are always unable to efficiently predict patterns that follow pointers in a linked data structures. Moreover, the CPUcentric design philosophy has led to very complex superscalar processors with deep pipelines and a large amount of support logic. Recent VLSI technology trends offer a promising solution to bridging the processor-memory gap: integrating processor logic and memory on a single chip. There have been significant recent advances in the concept of mixing memory and logic closer than in a CPU-Memory dichotomy, such as PIM architecture [1], IRAM architecture [2], In-Cached Computations (ICCs) [3], memory-side prefetching[4], etc, which all seek to improve performance and lower power consumption by utilizing the simple embedded logic in memory. We propose to create “smart” memories, which are embedded logic-in-memories that contains significant amounts of logic in or next to memory, can enable performance improvement and energy savings, as shown in Fig.1. While many applications might benefit from this methodology, we are initially focusing on two “logic in memory” modules: "trigonometric interpolation memory" module and “pointer chasing memory” module. Figure 1. Smart memory—Embedded Logic in Memory 2. Proposed Ideas 2.1 Experiment1- Trigonometric Function Evaluation Several computer applications, digital signal processing in particular, require the repeated evaluation of a few specific functions such as trigonometric function evaluation. Usually, function values are obtained by computing the terms of a series to the required degree of precision. But this may be unacceptably slow. Alternative solutions are either table-lookups or special-purpose arithmetic units. But the table-lookup approach is useful only for limited-precision applications, since the table size grows exponentially with the argument precision. And specialpurpose hardware designed for each required function may be expensive and inflexible. In this study, we consider a general technique that combines the table-lookup (memory) and computational (arithmetic logic) approaches to do function evaluation. The hardware realization of the technique will be called “interpolating memory”. It is in essence a memory unit that contains a subset of the required values of f(x) together with the means of interpolating additional function values on each read reference. The locality property of interpolation lends itself to pushing the computation into intelligent memory, resulting in computation close by the data and transfer to the CPU of only the final result. One possible application can be achieved by integrating this interpolation memory into the Data Pump Architecture(DPA), which is an ongoing project in SPIRAL lab. DPA is built as SPARC architecture[5] and has algorithm-specific customizable on-chip memory organization. Our goal is to plug our “smart” memory into the onchip memory organization of DPA and exploit it for the DSP related tasks like the twiddle factor supplies for large size FFTs. 2.2 Experiment2- Pointer Chasing Memory On the other hand, we will also exploit this idea further on many other well-known applications which inherit memory latency based performance problems. One such important problem is irregularly distributed data of Linked Data Structures (LDS) in memory in which regularity is not exhibited due to the irregular addressing; thus using conventional prefetching methods can even yield to worse speedups. One possible solution for this problem in LDS is “pointer chasing” method; which exploits the “serial nature of pointer dereferences” in the lower level memory which is independent of CPU[6]. Within our project, along with the trigonometric function calculation using interpolation memory, in sake of generalization of the idea, we also propose to use our logic-in-memory implementation for pointer chasing on linked list data structures. As our secondary objective, our aim is to design a similar logic-in-memory structure that will explicitly perform pointer chasing when a linked list traversal is issued from the processor. We are planning to use a search engine integrated to memory to search particular key/pattern in the LDS[7]. In our solution scheme processor ships the template for the key/pattern search code to the search engine. Then, the search engine, which is essentially the “logic” in the memory, searches the key by traversing the LDS and comparing the data to the key ahead of the processor[8]. By this way, we are aiming to eliminate all the round-trip data movement among main memory, caches and CPU. Abstract depiction of our solution scheme is given in figure 2. Figure 2. Abstract depiction of the solution scheme 3. Research plan st 1 Milestone: For “interpolation memory”, a preliminary functional-level C program implementation of memory modules enhanced with embedded interpolation logic will be performed. At the same time, we will explore the corresponding computer architectures in which the smart memory can be plugged. At this stage, regarding pointer chasing we will read more papers and understand the details thoroughly. We determine the architecture of proposed search engine. nd 2 Milestone: In this stage, for “interpolation memory”, we will translate the C program into Verilog and the design will be mapped on the standard-cell library using commercial synthesis tool or evaluated through FPGA prototyping. Power, area and performance will be estimated. Regarding the pointer chasing application, we will build C/Verilog model of the proposed solution at this stage. rd 3 Milestone: At this stage interpolation memory will be plugged into one or several computer architectures. System performance/energy benefits and comparison with CPU-only implementation will be demonstrated. Regarding the pointer chasing we will do architectural level simulations and evaluate performance improvements. References [1] Peter M. Kogge,Toshio Sunaga, Eric Retter. Combined DRAM and Logic Chip for Massively Parallel Systems. Sixteenth Conference on Advanced Research in VLSI. Page 4 – 16. Mar 1995. [2] Romm, R., Perissakis, S., Cardwell, N. The Energy Efficiency of IRAM Architectures. The 24th Annual International Symposium on Computer Architecture. Page 327 – 337, 1997. [3] Patrick A. La Fratta, Peter M. Kogge. Design Enhancements for In-Cache Computations. 2009. [4] C.J. Hughes, S.Adve. “Memory-Side Prefetching for Linked Data Structures”, UIUC CS Technical Report, 2001 [5] “LEON3 Processor”, http://www.gaisler.com/. [6] C.L.Yang, A.R.Lebeck. “Push vs Pull: Data Movement for Linked Data Structures.”. In Proc.of the 2000 Intl.Conf. on supercomputing, May 2000. [7]Yan Solihin, Jaejin Lee, Josep Torrellas: Correlation Prefetching with a User-Level Memory Thread. IEEE Trans. Parallel Distrib. Syst. 14(6): 563-580 (2003) [8] Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt, "Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems". HPCA 2009.