as a PDF

Submitted for review, IEEE Computer, January 10, 1997 The Case for Processing-in-Memory Jay B. Brockman and Peter M. Kogge Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 jbb@cse.nd.edu, kogge@cse.nd.edu Abstract: The conventional approach to computer design has been to couple the fastest possible processor with the cheapest, densest memory. Because of the widening gap, however, between the bandwidth that processors demand and that memories can supply, such architectures are reaching their limits. One approach to bridging this gap is moving the processor onto the memory chip. Not only can processing-in-memory, or PIM, lead to improved performance but it may do so with significant reductions in complexity, area, and power. In this article, we first review trends in processor and memory performance, and show how much of the bandwidth currently available in DRAMs is literally thrown away. Next, we discuss the architectural spectrum of PIM designs, with examples of actual chips, and show how the memory “hidden bandwidth” can be reclaimed. Finally, we present technology and design methodology challenges for implementing PIMs, and conclude with future directions. Keywords: architecture, bandwidth, memory, processing-in-memory 1.0 Introduction For the past several decades, the conventional approach to computer system development has been to separately optimize the processor and memory components. Traditional wisdom has long maintained that the greatest gains in computing performance result from coupling the fastest pos- This work was supported in part by NASA grant NAG 5-2998, NSF grant ACS96-12028, JPL grant 960587, and support from the Special Purpose Processing Group of the Mayo Clinic. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. The Case for Processing-in-Memory 1 Submitted for review, IEEE Computer, January 10, 1997 sible processor (superpipelined, superscalar, out-of-order execution, etc.) with the cheapest, densest memory through a multi-level memory hierarchy. Each side (logic and memory) were and are designed with different techniques, using different tools, fabricated with different technologies, and in general optimized for different goals. While clearly successful, such a bifurcated approach is rapidly approaching a wall. Significant evidence exists that classical architectures are already reaching their limits, with continued performance improvement coming at a very high cost in terms of transistor count, area, and power. The problem lies at the heart of this bifurcation - the coupling between the ever more sophisticated microprocessor chip and the memory chips. Clock rates for CPUs are increasing by 30% a year, with overall bandwidth demands by the core CPUs growing even more rapidly. Multiple execution units, 64 bit operands, and speculative look-ahead, for example, all increase the number of bytes per cycle needed to keep the CPU fed. Furthermore, with all this sophistication comes a growing reliance on not only raw bandwidth but also low latency access times to avoid performance-killing stalls. On the other side, the spectacular technology improvements in DRAM have been limited largely to density. Basic access time has improved only minimally over the last decade. The number of pins available on memory parts for data transfers has not grown dramatically for cost reasons. Actual transfer rates have been constrained by the processor’s memory bus, and that is now 4 to 8 time slower than the internal CPU clock rate. The result: a growing gap between the needs of a modern CPU core and the capabilities of the memory system. That this effect is known to system designers is evidenced by the growing reliance on expensive multi-level caching schemes, and the introduction of new memory architectures that specifically try to improve bandwidth and latency in certain cases, such as Extended Data Out DRAM (EDO), dual bank synchronous DRAM (SDRAM), video RAM (VRAM), or RAMBUS (RDRAM). Are these techniques working? The evidence is that they are not. Reports are growing that, for example, machines with the capability of executing 4 instructions per instruction are actually performing at as poorly as 4 cycles per instruction, on real code!3,12. Analyses1 of the causes indicate that the lion’s share is from this mismatch with the memory. If solutions are not found then the The Case for Processing-in-Memory 2 Submitted for review, IEEE Computer, January 10, 1997 machines of tomorrow may use significantly more transistors to achieve only slightly better performance than we have today. An alternative approach to system organization, that promises to deliver high performance with the greatest silicon efficiency, is to place the processing logic on the same die as high density memory, especially DRAM. Such organizations have been termed Processing-In-Memory (PIM)5, or Intelligent RAM (IRAM)8. By placing the actual processing logic on the memory chip, and logically and physically matching it to the internal memory arrays, PIM is able to consume the very high internal memory bandwidth available within the DRAM structure itself. Not only does this provide for an extremely efficient coupling between the processor and memory, but it also liberates chip contacts for more productive work, such as interfacing with sensors and I/O, or communicating with other PIM chips. Through the three mechanisms of locality for markedly reduced memory latency, wide data access for greatly increased memory bandwidth, and new classes of architectures that provide natural parallelism, PIM avoids the performance barriers imposed by the memory system that limit conventional architectures, and holds the promise of achieving unmatched levels of performance and scalability, at reduced power and silicon area. The greatest performance gains in PIM-based architectures will go to those designs that can best leverage these three mechanisms. This paper states the case for considering such configurations. First we set the stage by analyzing expected trends in conventional technology. Then we review the nature of a modern DRAM: what bandwidth and latency capabilities are present, and how current architectures throw it away. Next, we define briefly the design space that is opening up for PIM-based chips, with examples of real chips that have begun the exploration. Following this is a discussion of the fabrication challenges that await us. Finally, we end with some projections of where commodity PIM parts of the future may take us. 2.0 Supply and Demand Fig. 1 gives two interesting projections of future microprocessor computer systems. Fig. 1(a) diagrams two curves: a projected clock rate for commodity CMOS microprocessors11 and potential bandwidth demands of the core CPUs of such chips, assuming that the increasing CMOS logic capabilities are used for ever more aggressive multiple instruction execution, wider operands, vec- The Case for Processing-in-Memory 3 Submitted for review, IEEE Computer, January 10, 1997 tor operations, speculative look-ahead, and multi-threading.Fig. 1(b) diagrams a projected growth in per chip memory density (again using SIA projections, but delayed slightly to reflect commodity market) along with the total amount of memory expected in a commodity PC. 1000 1000 80 20 100 0 1995 1998 (a) 2001 2004 2007 100 PC Capacity 40 Bandwidth (GB/s) Clock Rate (MHz) 60 10 1 1995 1998 2001 2004 2007 (b) Microprocessor Clock Rate PC Capacity Core CPU Bandwidth Needs Commodity DRAM Chip Density FIGURE 1. (a) Projected clock rate and bandwidth of CMOS commodity microprocessors (b) Projected growth of memory density An interesting date in these charts is 2003. About this time the projected memory needs for a PC can be satisfied by a single DRAM part. At this point if we assume our commodity microprocessor has a 400 MHz clock, a 10 way superscalar CPU with 5 load/store units, 64 bit data operands, and is multithreaded, then the core CPU (before any caches are referenced) requires something on the order of 40 GB/s. If we assume an on-chip two level cache (using 10s of millions of transistors) with level 1 instruction cache hits of 98% and data cache hits of 90%, and level 2 cache hits of 75%, then about 50 million misses per second leave the level 2 cache. If level 2 cache lines are on the order of 256 bytes, this translates into a bandwidth requirement of about 12 GB/sec from our lone memory part. Looking at a conventional memory part, the average data bandwidth it can provide may be estimated simply as: data bus utilization ( number of pins on data bus ) × ( bus clock rate ) × -----------------------------------------------------------------------------average clock cycles per transfer Assuming that we want to keep our memory bus utilization down to about 50%, and that we have applied aggressive techniques inside the DRAM to allow a transfer per cycle, then this means that the product of the number of data pins and the data bus clock rate must be about 24 GB/s. This is The Case for Processing-in-Memory 4 Submitted for review, IEEE Computer, January 10, 1997 about 50 times greater than the highest bandwidth current part.s - RAMBUS. If we assumed today’s best PCI bus of about 66 MHz, this translates into about 400 data pins (not counting parity, ECC, addresses, etc.). Assuming the SIA projected maximum off-chip rates of 300 MHz drops this to a mere 80 data pins. However, throwing in parity, addressing, control, power and the like, our lone memory part takes on the packaging and cooling challenges of our microprocessor chip. And, we have yet to factor in connections for I/O, especially the demands of very high resolution displays and, what should be very ubiquitous, multiple processor memory coherency support. The net effect is a memory part that looks radically different from the cheap commodity parts of today. In summary, from a design point of view, we have expended a huge amount of complexity, transistors, packaging, and associated power on both sizes of the interface to bridge the gap. 3.0 Memory Organization and the “Hidden Bandwidth” While the rate at which a memory system can supply data over conventional busses may be hugely less than core CPU demands, internally DRAM data availability even today is much greater. As shown in Fig. 2, a typical DRAM is organized as banks of memory “unit arrays,” connected by a network of address distribution, decoding, and selection logic 9. A conventional DRAM read operation is composed of two phases, row access (RAS) and column access (CAS). During the RAS phase, a row address is asserted on the address bus and decoded, causing a row of data within one or more of the unit arrays to be selected and written to a row buffer. During the CAS phase, the selected row(s) of data is further multiplexed according to the decoded column address, and driven off-chip as a word typically ranging from 1 to 9 bits in length. During this operation, a vast amount of potential bandwidth is literally discarded in order to funnel the data from the unit arrays onto the narrow off-chip data bus. Further, this “funnelling” process adds significantly to the latency of accessing random data. By characterizing the organization of the unit arrays in the overall floorplan, we can determine the amount of “hidden bandwidth” available within a DRAM. Each DRAM unit array is arranged as a grid of wordlines (rows) and bitlines (columns), with memory cells at the intersections. A DRAM memory cell contains a capacitor that can store charge—where the presence or lack of charge indicates whether that cell holds a logical 1 or 0— The Case for Processing-in-Memory 5 Submitted for review, IEEE Computer, January 10, 1997 RAM unit array bitline RAM unit array Decode wordline bitline Decode wordline sense amplifiers sense amplifiers row buffer row buffer 256-4096 bits multiplexor 1-9 bits FIGURE 2. Generic DRAM architecture as well as a pass transistor that connects the capacitor to the grid. When a wordline is selected during a read operation, the pass transistors of each cell along that wordline are turned on, causing the charge (if any) on the cell capacitors to be transferred to the bitlines. This transfer of charge in turn causes a minute change in the bitline voltages. A sense amplifier at the end of each bitline detects this change in voltage and amplifies it to logic levels. During a read operation, some of the charge stored in a cell is lost. As a result, the sense amplifiers must also drive their results back onto the bitlines to refresh the cells’ contents, and further, every bitline in the unit array must have a unique sense amplifier to perform this refresh. In this manner, an entire row of data is read from a unit array at a time. A write operation is similar, except that selected bitlines are forced to either high or low voltages prior to refresh. The size of a unit array is determined by a combination of technology constraints and design trade-offs. A primary concern in DRAM design is the ratio between the cell capacitance and the bitline capacitance, as charge sharing between the cell and the bitline limits the ability to sense the charge stored in a cell. This effectively restricts the length of a bitline and hence the number of wordlines in a unit array. The number of bitlines, on the other hand, is limited by the parasitic resistance and capacitance of the wordline, which slows the wordline select time. Further, because each bitline in a unit array requires a sense amplifier that is activated with each read, power dissipation is also a concern in determining the number of bitlines. Because of these considerations, typical unit array sizes have not grown appreciably between memory generations. As examples, in The Case for Processing-in-Memory 6 Submitted for review, IEEE Computer, January 10, 1997 one IBM 16 Mbit DRAM2, the unit array size was 128 Kbits, organized as 512 wordlines and 256 bitlines. In a 1 Gbit DRAM recently announced by Samsung14, the unit array size is 256 Kbits, organized as 512 wordlines and 512 bitlines. Because all bitlines are sensed during a read operation, the bandwidth available from a unit array is tremendous. To calculate this bandwidth, the number of cells per wordline is divided by the delay from the time that a row address is strobed to the time that the sense amplifier output has been latched in the row buffers. If the row delay for the 16 Mbit DRAM is conservatively estimated as 30 ns, the effective unit array bandwidth is 256 bits/ 30 ns = 1 GB/s. For the 1 Gbit DRAM, where the time of an entire read cycle is 31 ns, the available bandwidth per unit array is over 2 GB/s! In order to determine the total available internal DRAM bandwidth per chip, the number and placement of the unit arrays in the overall floorplan must be considered, as well as techniques for routing data from the unit arrays to the interface circuitry (periphery). Fig. 3(a) shows a floorplan similar to the IBM 16 Mbit DRAM. The entire chip contains 128 unit arrays, arranged in four banks of 32 unit arrays, with two banks on either side of the centrally-located interface circuitry. In this organization, data is routed to the interface circuitry via metal busses that run over the unit arrays, with one data line multiplexed between 8 bitlines. The net result is that each of the four banks presents 128 data lines to the interface. If each bank were turned on simultaneously, with a row read time of 30 ns, the total internal bandwidth would be 2 GB/s. In general, the higher the external bandwidth of a DRAM, the greater the amount of internal bandwidth that is wasted. In a 1996 ISSCC paper on their 1 Gbit synchronous DRAM, Mitsubishi reports that only 16 K sense amplifiers are activated during each read7. Again assuming a very conservative 30 ns access time, this amounts to an internal bandwidth of over 60 GB/s - enough for our 2003 core CPU without caching. Although the external bandwidth of the chip is 0.2 GB/s (through 64 data pins), this still represents less than 0.5 percent of the internal bandwidth at the sense amplifiers. If one went to the extreme and gave independent access to each unit array, then the peak bandwidth available in a 16 Mbit design (with 128 unit arrays) would be in excess of 128 GB/s, and a The Case for Processing-in-Memory 7 Submitted for review, IEEE Computer, January 10, 1997 part like the 1 Gbit DRAM, with 4,096 unit arrays, would have 8 TB/s (terabytes/s) - enough bandwidth to support 200 of our 2003 state-of-the-art microprocessor cores, and this is with today’s technology! 4.0 Reclaiming the “Hidden Bandwidth”: The PIM Architectural Spectrum From the memory bandwidth equations presented earlier, it is clear that there are three ways of improving the supply of data from the memory system to a processor: either increasing the number of data pins on the memory, reducing the average access time from the memory, or improving the utilization of the data bus. The key to PIM is the ability to reclaim the vast “hidden bandwidth” from within the memory through optimization of each of these three parameters. More than simply embedding processing logic and memory on the same chip, this viewpoint affects the design at all levels, from the instruction set, to logic, to the physical floorplan. PIM as a technology clearly offers some stunning improvements in basic memory parameters—so much so that it permits returning to basic computer design with a “clean sheet of paper” view that is not distorted by “classical” approaches. Although only perhaps a dozen significant PIM chips have been designed to date, several distinct architectural directions have emerged. These directions vary in their generality of function, how amenable they are to scaling to multi-chip systems at enhanced performance levels, and how effective they are in using the bandwidth potential on chip. They also vary in how sensitive they are to technology, such as at what level of memory or logic density they first become feasible, what ratios of logic to memory area the require, and how effective they are at extracting usable performance from the silicon area that they consume. To date we have identified seven distinct architectural directions that one might take in using PIM technology. Each is discussed briefly below. Fig. 3 outlines some generic chip floorplans. Moving The Cache onto the Memory One clear approach to reducing the bandwidth gap between the memory and the processor is reducing the average access time from memory. Although there are limits to the time that it takes to access an individual memory cell (imposed by the ability to sense charge in a cell), the standard memory organization discussed in the previous section makes it possible to exploit data locality, The Case for Processing-in-Memory 8 Submitted for review, IEEE Computer, January 10, 1997 256 kb DRAM 256 kb DRAM 256 kb DRAM 256 kb DRAM 256 kb DRAM DMA CPU 256 kb DRAM DMA CPU 256 kb DRAM DMA CPU 256 kb DRAM DMA CPU 256 kb DRAM CPU DMA 256 kb DRAM (a) 256 kb DRAM CPU DMA 256 kb DRAM CPU DMA 256 kb DRAM CPU DMA 256 kb DRAM decode and interface logic 256 kb DRAM 256 kb DRAM unit array (d) 4 Mb DRAM RAM I/F RAM RAM RAM RAM RAM RAM CPU I/F RAM RAM RAM RAM CPU RAM CPU RAM CPU RAM RAM RAM CPU RAM RAM RAM CPU RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM CPU RAM RAM CPU RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM CPU RAM RAM RAM I/F CPU CPU RAM Logic RAM Logic RAM Logic RAM Logic RAM Logic RAM Logic RAM Logic RAM Logic RAM word drivers and row decoder CPU CPU RAM Logic RAM Logic RAM Logic RAM RAM Logic Logic RAM Logic RAM Logic RAM Logic RAM (b) CPU RAM CPU RAM 4 Mb DRAM RAM Mem I/F 4 Mb DRAM RAM cache cache CPU RAM 4 Mb DRAM (e) (c) FIGURE 3. PIM floorplans (not to scale) (a) Commodity DRAM (b) Mitsubishi M32R/D (c) Linden DAAM (d) EXECUBE (e) Shamrock so that once a given address has been read, other addresses on the same row are latched into the sense amplifiers and may be read very quickly. As a result, the average data access time decreases for the typical addressing patterns that arise from the execution of computer programs. This natural “caching” ability within the memory is exploited in fast page mode or extended data out (EDO) DRAMs. The idea is carried even further in the RAMBUS architecture, which has multiple internal cache lines to reduce the miss rate, as well as data encoding that improves pin utilization. To take this notion to the extreme, Saulsbury et. al. of Sun Microsystems showed that by embedding a two-way set associative cache with only 32 lines that run the full width of the mem- The Case for Processing-in-Memory 9 Submitted for review, IEEE Computer, January 10, 1997 ory in a 256 Mbit DRAM, memory latencies for the SPEC ‘95 benchmark suite could be so low that a 200 MHz MicroSparc-II with such a memory system could have overall performance comparable to a 300 MHz Alpha 21164 (DEC 8200 5/300)10. In terms of silicon efficiency, because the sense amplifiers are already in place, integrating the cache with the DRAM offers tremendous savings in transistor count, and hence power, over a conventional, separate cache. Moving the storage and logic for the cache function onto the memory chip is a first step towards exploiting the benefits of Processing-in-Memory. Such an approach, however, is still limited by the very narrow data channel for moving data off-chip, which in turn is constrained by the number of pins. Only by moving the entire processor onto the memory chip can the full performance potential be realized. Fixed Function Processing Because of the wide data access to memory, PIM provides natural opportunities for parallelism, giving rise to a range of architectural possibilities. These architectures may be characterized by the efficiency with which they can consume these very wide data words, the flexibility of the design, and the overall complexity. Perhaps the most straightforward application of PIM technology in this manner is in the design of accelerators that perform bit-level arithmetic operations at the memory row buffers. The most obvious candidate applications involve graphics processing, with one or more layers of the graphics pipeline integrated onto a memory part that is used for display buffering. Functions might include Z-buffering, bit-blitting, texturing, etc., and could serve as accelerators in conventional systems. Several such chips are in production now. Other such PIMs are finding their way into network interface buffers. Because these functions can be implemented with comparatively small amounts of logic and typically require lower clock rates, such chips have the key advantage of stressing the semiconductor technology the least. Further, the amount of memory needed for an entry level part is fairly low; for example, for a display with a million pixels, graphics PIMs become feasible with a low number of megabytes on chip—certainly within range of many different technologies today. There are, however, two major disadvantages to such architectures. First, a non-programmable chip has a relatively small range of applicability. Second, within one system such chips are not The Case for Processing-in-Memory 10 Submitted for review, IEEE Computer, January 10, 1997 highly scalable; for example, in a graphics environment three simple PIMs—one per color— would just about exhaust the back end of the graphics pipeline. Both of these disadvantages would limit market volume, driving up the price that must be charged per chip to recover the design efforts. Single Chip Computing Systems Another obvious direction for PIMs is the integration onto a single chip of a complete computing system: memory, CPU, and major I/O subsystems. This is an extension of the single chip microcontrollers used for years, but with significantly more powerful CPUs and significantly larger amounts of memory. There are several very significant advantages to single-chip system integration. First, single chip types simplify system designs. Second, wide off-chip memory and cache busses can be eliminated, reducing pin count and (often significantly) power consumption. Third, single chip systems also reap the benefits of integrating the cache onto the memory as discussed earlier. Finally, as technology advances, the migration of the chip to either a smaller die or one with more memory is direct. There are, however, potential problems with single-chip system. The first set of problems arise from technology limitations. Although total silicon area for a single-chip system is less than for a multi-chip solution, it is all on one die, having a potentially adverse effect on yield. Further, if the design targets both the densest available memory and a complex “state-of-the-art” CPU, it may require the best features of memory and logic fabrication processing—discussed later in this paper—which would greatly increasing chip cost. Additional problems may result from scalability. If one wants to build coherent shared memory systems, then there must be sufficient off-chip contacts to carry the internode bandwidth, as well as on chip logic to handle the memory coherency issues. Both of these drive up the price of a part without providing additional functionality when used in the most probable singleton configuration. Existing PIMs of this type run a gamut of architectural sophistication. In the early 90s, a version of Cal Tech’s MOSAIC chip integrated a small amount of planar DRAM onto the CPU chip. The more recent Analog Devices SHARC line of DSPs use several megabits of fast SRAM for on-chip memory connected in “classical” fashion to an on-chip CPU. This provides very high levels of performance with a relatively simple design. The floorplan of the Mitsubishi M32R/D, shown in The Case for Processing-in-Memory 11 Submitted for review, IEEE Computer, January 10, 1997 Fig. 3(c) has 16 megabits of DRAM, augmented with a small first level cache abutting each of the two memory arrays, and a simple, pipelined CPU in the center of the chip between the two halves of cache. The chip also provides the equivalent of a memory bus to allow the on-chip CPU to access off-chip memory. The Mitsubishi PIM demonstrates many of the performance advantages predicted by SUN’s simulations, namely, that a small cache embedded in the memory can reduce the memory bottleneck to the point that a simple CPU can achieve near peak performance, comparable to that of a high-speed microprocessor but without the complexity. An interesting observation of the above examples is that while they do successfully integrate a processor onto a memory chip, they still use conventional processor architectures. None of these use large fractions of the potential peak memory bandwidth, and the part that they do use is to fill cache lines, not to drive computational function units directly. As such, alternative architectures, such as VLIW, may in fact match PIM potentials better than today’s designs. SIMD in Memory One way to use more of the bandwidth of PIMs is to place multiple programmable dataflows very near the sense amps, and to operate such dataflows in a SIMD parallel fashion. Data is read into the dataflows from multiple on-chip memory stacks simultaneously, operated on simultaneously, and then written back simultaneously. For appropriate applications, such an approach is both simple and enormously effective. In many cases, an almost unlimited number of chips can be ganged together for even more performance. Finally, in terms of a ratio of memory to logic the entry point for building such devices is often relatively low. In a real sense, the key chips that enabled such machines as the original MPP, the DAP, the CM-2, and a variety of associative memory machines were all SIMD PIMs, albeit with a very small amount of simple flip-flop memory per node. More recently, chips such as the Terasys and the Linden DAAM have used denser SRAM or DRAM as the base memory technology, with significantly more complex dataflows positioned very near the sense amps. The DAAM, outlined in Fig. 3(c) in particular exhibits a huge internal parallelism—it has on one 4 megabit DRAM 4,096 separate 8 bit ALUs, all capable of processing simultaneously. There are two limitations of such PIMs. First, because they are SIMD, they are still limited in their programmability. Second, although they are capable of huge peak performances when the The Case for Processing-in-Memory 12 Submitted for review, IEEE Computer, January 10, 1997 operands for each dataflow are in that dataflow’s memory bank, the performance can often plummet tremendously if operands for one dataflow must come from some distant memory bank, especially if it is off chip. Not only must chip contacts be made available to carry such traffic, but a great deal of complexity may be expended in programming the transfers in an efficient manner. SMP PIMs The next category of PIMs attempts both to use more of the on-chip bandwidth and to reduce the inter-dataflow communication problem by placing multiple complete processors in a shared memory Symmetric Multi Processors (SMP) configuration on-chip. Current designs of this type, such as the Texas Instruments MVP, partition the on-chip memory into a small number of essentially conventional looking memory modules, and then implement a cross-bar or equivalent network to interconnect them to the processors. The MIT MAP project is similar, placing four VLIW-like CPUs on a chip with 128KB of shared SRAM, although in this case the on-chip memory is treated as cache for off chip DRAM. Memory density is today the major constraint in the construction of such chips. For the most part, current chips have used low density SRAM, with the result that for many real applications the onchip processors are memory starved, and spend a great deal of time importing and exporting data that will not all fit on chip. While this design approach also can result in a very effective single chip design, it still runs into complexity problems when more than one such chip is to be combined into a single system. The reason is that the architectural mechanisms needed to tie multiple chips together (either I/O ports or again some sort of shared memory bus with coherence support) are fundamentally different than that used between processors on-chip. This places a great burden on the programmer, who must now know when to invoke which mechanism. Further, such programs are going to be very fragile as technology advances. Increasing the number of processors on-chip, for example, may totally destroy a program written for an earlier generation PIM. Tiled PIMs Another approach to parallel PIM designs takes are more holistic view. The silicon area is recognized up front as a two dimensional surface onto which logic areas and memory areas can be The Case for Processing-in-Memory 13 Submitted for review, IEEE Computer, January 10, 1997 “tiled.” Several architectural goals govern the tiling arrangement. First, processor and memory components are interleaved so that maximal memory bandwidth can be provided in the richest possible manner to the maximum number of processing units. Second, the depth of each memory bank can be adjusted to provide appropriate processing-to-storage ratios for a class of related applications. Finally, the interconnections between memory and processor components should be simple, uniform, high bandwidth, and scalable, within a single chip as well as between multiple chips. As with SMP chips, the major constraint to the construction of such chips is memory density. Matching the inter node interconnection to what comes naturally on the surface of a chip, however, can both simplify silicon use and provide simpler programming models. Perhaps the first step in the direction of tiled PIMs was the EXECUBE chip4,13, shown in Fig. 3(d). In this design, two 32 KB DRAM memory unit arrays were teamed with a complete standalone 16 bit CPU, with eight of these combination nodes tiled over a single die. Each such node included support for four DMA links. Three DMA links from each node were used to form a low wiring complexity binary hypercube of computers on chip, with the remaining link attached to off-chip contacts. Thus large combinations of chips could be constructed by simple interconnection of DMA links between neighboring nodes, with a single interconnection protocol. This also formed the basis for the programming model. Because of the origins of the memory macros used in EXECUBE, only limited access from the CPU’s instruction set to all of the sense amps was possible. Even so, under certain modes, several gigabytes per second of on-chip memory bandwidth could be profitably consumed. Our current work is carrying this design direction even further with the Shamrock PIM architecture 6. In the Shamrock floorplan, shown in Fig. 3(e), bands of processors and memory banks alternate in stripes across the chip, with the sense amplifiers of the memory facing the processors. Identical processing nodes will employ a combination of SIMD dataflows and a VLIW-like CPU. A key innovation in Shamrock is to provide access to the memory from either end of the banks. This allows processors on either side the best possible interconnection mechanism—a true shared memory link. By offsetting the positions of the processing nodes on different rows, a topology similar to a 2-d mesh is formed, where each node has true shared memory access to four other The Case for Processing-in-Memory 14 Submitted for review, IEEE Computer, January 10, 1997 nodes. Appropriate floorplanning then allows interface logic on the periphery of the chip to connect to memory banks on other chips, again with a single uniform programming model. Reconfigurable PIM As a low-cost, quick-turnaround approach to PIM development, we have proposed the idea of a reconfigurable PIM gate-array. Sized at approximately 32 Mbits of high-density DRAM and 250 K logic gates, such a chip could accommodate a wide variety of applications, such as accelerators for image and signal processing. The floorplan for a reconfigurable PIM chip would contain multiple memory banks and fixed-point dataflows pitch-matched to the memory arrays. The chip would also contain regions of seas-of-gates or gate array for implementing other logic,control, and interface functions. 5.0 Challenges in Implementation For the past several decades, the conventional approach to computer system development has been to separately optimize the memory and processor components. This has led not only to a performance gap but also to a technology gap: processor and memory chips today are manufactured on fundamentally different fabrication processes and designed using fundamentally different methodologies. Logic and DRAM fabrication processes have significant differences in terms of both transistor and interconnect technology. Logic processes are designed to support both the fastest possible switching speeds as well as compact routing between modules. This has led to complex processes with many expensive steps. To reduce the layout area, four or more layers of metal interconnect are common, which requires wafer surface planarization before the deposition of each metal layer. Transistor device characteristics are also carefully optimized to minimize delay: for example, the polysilicon gates of PMOS and NMOS transistors typically are doped separately to compensate for work function differences. In the evolution of DRAM fabrication processes, the primary goals have been minimizing the area and maximizing charge retention of the memory cells, while keeping the overall manufacturing cost as low as possible. As a result, efforts have focused chiefly on optimizing the design of the memory cell, with minimal emphasis on the transistors and interconnect in the periphery. Today’s The Case for Processing-in-Memory 15 Submitted for review, IEEE Computer, January 10, 1997 DRAM cells have highly complex geometries that use either trench capacitors that extend deep into the bulk of the silicon, or stacked capacitors that have fin-like projections layered above the silicon surface. Both of these structures have far superior capacitance per unit area than planar capacitors that could be implemented with pure logic fabrication processes. Peripheral logic transistors in DRAMs, however, typically do not have the optimizations for speed that would be found in custom logic processes, as this would increase cost without significantly improving overall performance. As a result, DRAM periphery transistors are approximately half as fast as optimized logic transistors. Further, DRAM fabrication processes typically have only one or two layers of metal interconnect, as compared to four or more for logic processes. An integrated PIM technology requires the densest possible memory. It would also require reasonably fast transistors and sufficient layers of interconnect, yet as performance is gained through architecture, these may not have to be optimized to the degree as commonly found in high-performance logic processes. In time, such process enhancements may be made, but in general a PIM fabrication process would be based upon a commodity DRAM process. In order to assess the manufacturability and economic feasibility of PIM product, new yield models would have to be developed that account for defects in either the logic or the memory array, and that consider redundancy or error correction. As a result of the differences in function and technology, logic and DRAM design methodology has also evolved along very different lines. In logic design, some of the main challenges are at the highest levels of abstraction: determining a component structure that describes very complex behavior, and then verifying that this structure is correct. This has led to a library-based approach to design, using high-level synthesis and simulation CAD tools, as well as highly-automated component placement and routing. As much as is possible, detailed design and analysis at the transistor level is abstracted away from the logic design process. Because memories are largely analog designs, concerns at the transistor level are paramount. Electrical characteristics of the cells are determined through detailed 2-d device simulations. Extensive Spice simulation is required to adjust the D.C. and timing performance of the sense amplifiers and other peripheral circuitry. Circuit layouts are largely hand-crafted in order to minimize area, as well as to match the strict spacing requirements imposed by the cell array. The Case for Processing-in-Memory 16 Submitted for review, IEEE Computer, January 10, 1997 The challenge in developing a PIM design methodology goes beyond merely reconciling the differing practices and CAD tool suites currently used for logic and memory design. In order to be effective, the methodology must readily grant logic designers access to the full internal bandwidth of the memory, yet shield them from the detailed analog behavior. In order to to use the highest memory densities, the PIM design methodology must be synchronized with commodity DRAM technology cycles, with 18 months development and 18 months sustained processing. Further, because of the highly specialized nature of DRAM processing, a PIM design methodology must be “open,” so that design houses can interact readily with the foundry. The key to an open design methodology for PIMs is to base it upon the use of memory macros that provide a “black box” view of the memory for the logic designer. Given such memory macros, a library-based, ASIC-like design flow for PIMs would be possible. Such a memory macro would be designed to encapsulate a unit array from a high-density commodity DRAM, and would serve two purposes. First, it would hide proprietary and unnecessary details from the logic designer, such as cell design, internal timing, and redundancy. Second, it would provide architectural enhancements such as an extremely high bandwidth digital interface that would latch data just after the sense amplifiers. To work with the memory macro, logic libraries would be developed that are pitch-matched to the spacing constraints imposed by the cell array. In order to allow new PIM development to keep pace with the rapidly evolving technology cycles, new CAD tools would have to be developed, including DRAM macro generators and pitch-matching datapath synthesizers. Power management is clearly another issue in PIM design. In traditional commodity memory parts, where the goal is to produce a small, select amount of data at the pins, efforts are made to activate as few sense amplifiers as possible during accesses in order to avoid large power spikes. To produce high internal memory bandwidth for PIM applications, however, as many sense amplifiers as is feasible should be turned on in a single cycle. A final area of concern for PIM designs is testing. In general, test methodologies for logic and for memory are very different. Logic testing entails running long sequences of input vectors that collectively try to sensitize any defects in the logic paths. In memory testing, multiple complex address and data patterns are run to uncover any defects that could result from actual code execution. Because, in a sense, logic testing requires hardware that emulates memory and memory testThe Case for Processing-in-Memory 17 Submitted for review, IEEE Computer, January 10, 1997 ing requires hardware that emulates a processor, the potential exists for highly efficient built-in self-testing for PIMs. Such an approach was explored with EXECUBE, were a test program was initially loaded into memory to verify the basic functionality of a processor, which could then perform more exhaustive testing of the memory by local program execution. 6.0 Conclusions and Future Projections Clearly, the current trend of improving processor performance through exponential increases in architectural complexity cannot be sustained indefinitely. By reducing memory latency, providing wide data access for greatly increased memory bandwidth, and opening new opportunities for architectures with natural parallelism, PIM avoids the processor/memory bottleneck that plagues performance in today’s systems. This, and the fact that PIM offers unmatched levels of silicon efficiency and scalability, leads us to conclude that PIM is inevitably the next major evolution in computer architecture. What might commodity PIMs look like? Because the processing logic is placed in the memory, with PIM there may no longer a need for external, passive memory in the conventional sense. Instead, systems may be scaled by adding more PIM modules, onto the same chip or onto different chips. Ultimately, one may envision large amounts of PIM replacing conventional memory in the form of a “PIM-SIMM” where applications can either run in a conventional “host” microprocessor, or in the memory next to the data. This vision is the one we believe to offer the most direct and pervasive application of PIM, and one that at the same time utilizes all its capabilities in a reasonable programming model. 7.0 Bibliography 1. M. Bekerman and Avi Mendelson, “A Performance Analysis of Pentium Processor Systems,” IEEE Micro, Oct. 1995, pp. 72-83. 2. W. Ellis, et al., “Multipurpose DRAM architecture for optimal power, performance, and product flexibility,” IBM J. of Research and Development, Jan./March 1995, pp. 51-62. 3. B. Joy. “Microprocessor Architecture: The Next 10 Years" keynote address, HOT Chips VIII, Stanford, CA, August 18-20, 1996. The Case for Processing-in-Memory 18 Submitted for review, IEEE Computer, January 10, 1997 4. P. Kogge, “EXECUBE - A new architecture for scalable MPPs,” Proc. Int. Conf. on Parallel Processing, St. Charles, IL, Aug. 1994, pp. 77-84. 5. P. Kogge, et al., “Combined DRAM and logic chip for massively parallel systems,” Proc. 16th Conf. on Adv. Research in VLSI, Raleigh, NC, Mar. 1995, pp. 4-13. 6. P. Kogge, S. Bass, J. Brockman, D. Chen, E. Sha, “Pursuing a Petaflop: Point Designs for 100 TF Computers Using PIM Technologies,” Proc. Frontiers of Massively Parallel Computation, Annapolis, MD, Oct., 1996, pp. 88-97. 7. Y. Nitta, et al., “A 1.6GB/s Data-Rate 1 Gb Synchronous DRAM with Hierarchical SquareShaped Memory Block and Distributed Bank Architecture,” IEEE Solid State Circuits Conf., San Francisco, CA, Feb. 1996, pp. 376-377. 8. D. Patterson, T. Anderson, and K. Yelick. “A Case for Intelligent DRAM:IRAM,” HOT Chips VIII, Stanford, CA, August 18-20, 1996, pp. 75-94. 9. B. Prince, Semiconductor Memories, second edition, Wiley, 1991 10. A. Saulsbury, F. Pong, and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/ Memory Integration,” Proc. of 23rd ISCA, June, 1996. 11. Semiconductor Industry of America, The National Technology Roadmap for Semiconductors, SIA, 4300 Stevens Creek Blvd, Suite 271, San Jose, CA 95129, 1994. 12. R. Sites, “It's the Memory, Stupid,” Microprocessor Report, Vol. 10, No. 10, August 5, 1996, pp. 18-24. 13. T. Sunaga, et al., “A Parallel Processing Chip with Embedded DRAM Macros,” IEEE J. of Solid-State Circuits, Vol. 31, No. 10, Oct. 1996, pp. 1556-1559. 14. J. Yoo, et al, “A 32-Bank 1 Gb DRAM wirh 1GB/s Bandwidth,” IEEE Solid State Circuits Conf., San Francisco, CA, Feb. 1996, pp. 378-379. The Case for Processing-in-Memory 19

as a PDF

Related documents

Products

Support

as a PDF

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib