Application Specific Logic in Memory 1. Motivation and Project Definition

advertisement
Application Specific Logic in Memory
Qiuling Zhu, Berkin Akin, Huseyin Ekin Sumbul
1. Motivation and Project Definition
Memory hierarchy, data prefetching, as well as out-of-order execution are all popular techniques to tolerant long
memory access latencies in existing processors, especially for applications that fit nicely into their caches or with
enough localities for the caches. However, these platforms come up with bottlenecks for data/memory intensive
applications or applications with irregular data access patterns. For example, Digital signal processing (DSP) or
scientific computation often fail to meet CPU-speed based expectations by a wide margin. And prefetchers are
always unable to efficiently predict patterns that follow pointers in a linked data structures. Moreover, the CPUcentric design philosophy has led to very complex superscalar processors with deep pipelines and a large amount
of support logic. Recent VLSI technology trends offer a promising solution to bridging the processor-memory gap:
integrating processor logic and memory on a single chip. There have been significant recent advances in the
concept of mixing memory and logic closer than in a CPU-Memory dichotomy, such as PIM architecture [1], IRAM
architecture [2], In-Cached Computations (ICCs) [3], memory-side prefetching[4], etc, which all seek to improve
performance and lower power consumption by utilizing the simple embedded logic in memory.
We propose to create “smart” memories, which are embedded logic-in-memories that contains significant
amounts of logic in or next to memory, can enable performance improvement and energy savings, as shown in
Fig.1. While many applications might benefit from this methodology, we are initially focusing on two “logic in
memory” modules: "trigonometric interpolation memory" module and “pointer chasing memory” module.
Figure 1. Smart memory—Embedded Logic in Memory
2.
Proposed Ideas
2.1 Experiment1- Trigonometric Function Evaluation
Several computer applications, digital signal processing in particular, require the repeated evaluation of a few
specific functions such as trigonometric function evaluation. Usually, function values are obtained by computing
the terms of a series to the required degree of precision. But this may be unacceptably slow. Alternative solutions
are either table-lookups or special-purpose arithmetic units. But the table-lookup approach is useful only for
limited-precision applications, since the table size grows exponentially with the argument precision. And specialpurpose hardware designed for each required function may be expensive and inflexible. In this study, we consider
a general technique that combines the table-lookup (memory) and computational (arithmetic logic) approaches to
do function evaluation. The hardware realization of the technique will be called “interpolating memory”. It is in
essence a memory unit that contains a subset of the required values of f(x) together with the means of
interpolating additional function values on each read reference. The locality property of interpolation lends itself
to pushing the computation into intelligent memory, resulting in computation close by the data and transfer to the
CPU of only the final result.
One possible application can be achieved by integrating this interpolation memory into the Data Pump
Architecture(DPA), which is an ongoing project in SPIRAL lab. DPA is built as SPARC architecture[5] and has
algorithm-specific customizable on-chip memory organization. Our goal is to plug our “smart” memory into the onchip memory organization of DPA and exploit it for the DSP related tasks like the twiddle factor supplies for large
size FFTs.
2.2 Experiment2- Pointer Chasing Memory
On the other hand, we will also exploit this idea further on many other well-known applications which inherit
memory latency based performance problems. One such important problem is irregularly distributed data of
Linked Data Structures (LDS) in memory in which regularity is not exhibited due to the irregular addressing; thus
using conventional prefetching methods can even yield to worse speedups. One possible solution for this problem
in LDS is “pointer chasing” method; which exploits the “serial nature of pointer dereferences” in the lower level
memory which is independent of CPU[6].
Within our project, along with the trigonometric function calculation using interpolation memory, in sake of
generalization of the idea, we also propose to use our logic-in-memory implementation for pointer chasing on
linked list data structures. As our secondary objective, our aim is to design a similar logic-in-memory structure that
will explicitly perform pointer chasing when a linked list traversal is issued from the processor. We are planning to
use a search engine integrated to memory to search particular key/pattern in the LDS[7]. In our solution scheme
processor ships the template for the key/pattern search code to the search engine. Then, the search engine, which
is essentially the “logic” in the memory, searches the key by traversing the LDS and comparing the data to the key
ahead of the processor[8]. By this way, we are aiming to eliminate all the round-trip data movement among main
memory, caches and CPU. Abstract depiction of our solution scheme is given in figure 2.
Figure 2. Abstract depiction of the solution scheme
3.
Research plan
st
1 Milestone:
For “interpolation memory”, a preliminary functional-level C program implementation of memory modules
enhanced with embedded interpolation logic will be performed. At the same time, we will explore the
corresponding computer architectures in which the smart memory can be plugged. At this stage, regarding
pointer chasing we will read more papers and understand the details thoroughly. We determine the
architecture of proposed search engine.
nd
2 Milestone:
In this stage, for “interpolation memory”, we will translate the C program into Verilog and the design will be
mapped on the standard-cell library using commercial synthesis tool or evaluated through FPGA prototyping.
Power, area and performance will be estimated. Regarding the pointer chasing application, we will build
C/Verilog model of the proposed solution at this stage.
rd
3 Milestone:
At this stage interpolation memory will be plugged into one or several computer architectures. System
performance/energy benefits and comparison with CPU-only implementation will be demonstrated. Regarding
the pointer chasing we will do architectural level simulations and evaluate performance improvements.
References
[1] Peter M. Kogge,Toshio Sunaga, Eric Retter. Combined DRAM and Logic Chip for Massively Parallel Systems.
Sixteenth Conference on Advanced Research in VLSI. Page 4 – 16. Mar 1995.
[2] Romm, R., Perissakis, S., Cardwell, N. The Energy Efficiency of IRAM Architectures. The 24th Annual
International Symposium on Computer Architecture. Page 327 – 337, 1997.
[3] Patrick A. La Fratta, Peter M. Kogge. Design Enhancements for In-Cache Computations. 2009.
[4] C.J. Hughes, S.Adve. “Memory-Side Prefetching for Linked Data Structures”, UIUC CS Technical Report, 2001
[5] “LEON3 Processor”, http://www.gaisler.com/.
[6] C.L.Yang, A.R.Lebeck. “Push vs Pull: Data Movement for Linked Data Structures.”. In Proc.of the 2000 Intl.Conf.
on supercomputing, May 2000.
[7]Yan Solihin, Jaejin Lee, Josep Torrellas: Correlation Prefetching with a User-Level Memory Thread. IEEE Trans.
Parallel Distrib. Syst. 14(6): 563-580 (2003)
[8] Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt, "Techniques for Bandwidth-Efficient Prefetching of Linked Data
Structures in Hybrid Prefetching Systems". HPCA 2009.
Download