as a PDF

advertisement
Submitted for review, IEEE Computer, January 10, 1997
The Case for Processing-in-Memory
Jay B. Brockman and Peter M. Kogge
Department of Computer Science and Engineering
University of Notre Dame
Notre Dame, IN 46556
jbb@cse.nd.edu, kogge@cse.nd.edu
Abstract: The conventional approach to computer design has been to couple the fastest possible
processor with the cheapest, densest memory. Because of the widening gap, however, between the
bandwidth that processors demand and that memories can supply, such architectures are reaching
their limits. One approach to bridging this gap is moving the processor onto the memory chip. Not
only can processing-in-memory, or PIM, lead to improved performance but it may do so with significant reductions in complexity, area, and power. In this article, we first review trends in processor and memory performance, and show how much of the bandwidth currently available in
DRAMs is literally thrown away. Next, we discuss the architectural spectrum of PIM designs,
with examples of actual chips, and show how the memory “hidden bandwidth” can be reclaimed.
Finally, we present technology and design methodology challenges for implementing PIMs, and
conclude with future directions.
Keywords: architecture, bandwidth, memory, processing-in-memory
1.0 Introduction
For the past several decades, the conventional approach to computer system development has
been to separately optimize the processor and memory components. Traditional wisdom has long
maintained that the greatest gains in computing performance result from coupling the fastest pos-
This work was supported in part by NASA grant NAG 5-2998, NSF grant ACS96-12028, JPL
grant 960587, and support from the Special Purpose Processing Group of the Mayo Clinic.
This work has been submitted to the IEEE for possible publication. Copyright may be transferred
without notice, after which this version may no longer be accessible.
The Case for Processing-in-Memory
1
Submitted for review, IEEE Computer, January 10, 1997
sible processor (superpipelined, superscalar, out-of-order execution, etc.) with the cheapest, densest memory through a multi-level memory hierarchy. Each side (logic and memory) were and are
designed with different techniques, using different tools, fabricated with different technologies,
and in general optimized for different goals. While clearly successful, such a bifurcated approach
is rapidly approaching a wall. Significant evidence exists that classical architectures are already
reaching their limits, with continued performance improvement coming at a very high cost in
terms of transistor count, area, and power.
The problem lies at the heart of this bifurcation - the coupling between the ever more sophisticated microprocessor chip and the memory chips. Clock rates for CPUs are increasing by 30% a
year, with overall bandwidth demands by the core CPUs growing even more rapidly. Multiple
execution units, 64 bit operands, and speculative look-ahead, for example, all increase the number
of bytes per cycle needed to keep the CPU fed. Furthermore, with all this sophistication comes a
growing reliance on not only raw bandwidth but also low latency access times to avoid performance-killing stalls.
On the other side, the spectacular technology improvements in DRAM have been limited largely
to density. Basic access time has improved only minimally over the last decade. The number of
pins available on memory parts for data transfers has not grown dramatically for cost reasons.
Actual transfer rates have been constrained by the processor’s memory bus, and that is now 4 to 8
time slower than the internal CPU clock rate.
The result: a growing gap between the needs of a modern CPU core and the capabilities of the
memory system. That this effect is known to system designers is evidenced by the growing reliance on expensive multi-level caching schemes, and the introduction of new memory architectures that specifically try to improve bandwidth and latency in certain cases, such as Extended
Data Out DRAM (EDO), dual bank synchronous DRAM (SDRAM), video RAM (VRAM), or
RAMBUS (RDRAM).
Are these techniques working? The evidence is that they are not. Reports are growing that, for
example, machines with the capability of executing 4 instructions per instruction are actually performing at as poorly as 4 cycles per instruction, on real code!3,12. Analyses1 of the causes indicate
that the lion’s share is from this mismatch with the memory. If solutions are not found then the
The Case for Processing-in-Memory
2
Submitted for review, IEEE Computer, January 10, 1997
machines of tomorrow may use significantly more transistors to achieve only slightly better performance than we have today.
An alternative approach to system organization, that promises to deliver high performance with
the greatest silicon efficiency, is to place the processing logic on the same die as high density
memory, especially DRAM. Such organizations have been termed Processing-In-Memory
(PIM)5, or Intelligent RAM (IRAM)8. By placing the actual processing logic on the memory chip,
and logically and physically matching it to the internal memory arrays, PIM is able to consume
the very high internal memory bandwidth available within the DRAM structure itself. Not only
does this provide for an extremely efficient coupling between the processor and memory, but it
also liberates chip contacts for more productive work, such as interfacing with sensors and I/O, or
communicating with other PIM chips. Through the three mechanisms of locality for markedly
reduced memory latency, wide data access for greatly increased memory bandwidth, and new
classes of architectures that provide natural parallelism, PIM avoids the performance barriers
imposed by the memory system that limit conventional architectures, and holds the promise of
achieving unmatched levels of performance and scalability, at reduced power and silicon area.
The greatest performance gains in PIM-based architectures will go to those designs that can best
leverage these three mechanisms. This paper states the case for considering such configurations.
First we set the stage by analyzing expected trends in conventional technology. Then we review
the nature of a modern DRAM: what bandwidth and latency capabilities are present, and how current architectures throw it away. Next, we define briefly the design space that is opening up for
PIM-based chips, with examples of real chips that have begun the exploration. Following this is a
discussion of the fabrication challenges that await us. Finally, we end with some projections of
where commodity PIM parts of the future may take us.
2.0 Supply and Demand
Fig. 1 gives two interesting projections of future microprocessor computer systems. Fig. 1(a) diagrams two curves: a projected clock rate for commodity CMOS microprocessors11 and potential
bandwidth demands of the core CPUs of such chips, assuming that the increasing CMOS logic
capabilities are used for ever more aggressive multiple instruction execution, wider operands, vec-
The Case for Processing-in-Memory
3
Submitted for review, IEEE Computer, January 10, 1997
tor operations, speculative look-ahead, and multi-threading.Fig. 1(b) diagrams a projected growth
in per chip memory density (again using SIA projections, but delayed slightly to reflect commodity market) along with the total amount of memory expected in a commodity PC.
1000
1000
80
20
100
0
1995
1998
(a)
2001
2004
2007
100
PC Capacity
40
Bandwidth (GB/s)
Clock Rate (MHz)
60
10
1
1995
1998
2001
2004
2007
(b)
Microprocessor Clock Rate
PC Capacity
Core CPU Bandwidth Needs
Commodity DRAM Chip Density
FIGURE 1. (a) Projected clock rate and bandwidth of CMOS commodity microprocessors (b)
Projected growth of memory density
An interesting date in these charts is 2003. About this time the projected memory needs for a PC
can be satisfied by a single DRAM part. At this point if we assume our commodity microprocessor has a 400 MHz clock, a 10 way superscalar CPU with 5 load/store units, 64 bit data operands,
and is multithreaded, then the core CPU (before any caches are referenced) requires something on
the order of 40 GB/s. If we assume an on-chip two level cache (using 10s of millions of transistors) with level 1 instruction cache hits of 98% and data cache hits of 90%, and level 2 cache hits
of 75%, then about 50 million misses per second leave the level 2 cache. If level 2 cache lines are
on the order of 256 bytes, this translates into a bandwidth requirement of about 12 GB/sec from
our lone memory part.
Looking at a conventional memory part, the average data bandwidth it can provide may be estimated simply as:
data bus utilization
( number of pins on data bus ) × ( bus clock rate ) × -----------------------------------------------------------------------------average clock cycles per transfer
Assuming that we want to keep our memory bus utilization down to about 50%, and that we have
applied aggressive techniques inside the DRAM to allow a transfer per cycle, then this means that
the product of the number of data pins and the data bus clock rate must be about 24 GB/s. This is
The Case for Processing-in-Memory
4
Submitted for review, IEEE Computer, January 10, 1997
about 50 times greater than the highest bandwidth current part.s - RAMBUS. If we assumed
today’s best PCI bus of about 66 MHz, this translates into about 400 data pins (not counting parity, ECC, addresses, etc.). Assuming the SIA projected maximum off-chip rates of 300 MHz
drops this to a mere 80 data pins. However, throwing in parity, addressing, control, power and the
like, our lone memory part takes on the packaging and cooling challenges of our microprocessor
chip. And, we have yet to factor in connections for I/O, especially the demands of very high resolution displays and, what should be very ubiquitous, multiple processor memory coherency support. The net effect is a memory part that looks radically different from the cheap commodity
parts of today. In summary, from a design point of view, we have expended a huge amount of
complexity, transistors, packaging, and associated power on both sizes of the interface to bridge
the gap.
3.0 Memory Organization and the “Hidden Bandwidth”
While the rate at which a memory system can supply data over conventional busses may be
hugely less than core CPU demands, internally DRAM data availability even today is much
greater. As shown in Fig. 2, a typical DRAM is organized as banks of memory “unit arrays,” connected by a network of address distribution, decoding, and selection logic9. A conventional
DRAM read operation is composed of two phases, row access (RAS) and column access (CAS).
During the RAS phase, a row address is asserted on the address bus and decoded, causing a row of
data within one or more of the unit arrays to be selected and written to a row buffer. During the
CAS phase, the selected row(s) of data is further multiplexed according to the decoded column
address, and driven off-chip as a word typically ranging from 1 to 9 bits in length. During this
operation, a vast amount of potential bandwidth is literally discarded in order to funnel the data
from the unit arrays onto the narrow off-chip data bus. Further, this “funnelling” process adds significantly to the latency of accessing random data. By characterizing the organization of the unit
arrays in the overall floorplan, we can determine the amount of “hidden bandwidth” available
within a DRAM.
Each DRAM unit array is arranged as a grid of wordlines (rows) and bitlines (columns), with
memory cells at the intersections. A DRAM memory cell contains a capacitor that can store
charge—where the presence or lack of charge indicates whether that cell holds a logical 1 or 0—
The Case for Processing-in-Memory
5
Submitted for review, IEEE Computer, January 10, 1997
RAM
unit array
bitline
RAM
unit array
Decode
wordline
bitline
Decode
wordline
sense amplifiers
sense amplifiers
row buffer
row buffer
256-4096 bits
multiplexor
1-9 bits
FIGURE 2. Generic DRAM architecture
as well as a pass transistor that connects the capacitor to the grid. When a wordline is selected
during a read operation, the pass transistors of each cell along that wordline are turned on, causing
the charge (if any) on the cell capacitors to be transferred to the bitlines. This transfer of charge in
turn causes a minute change in the bitline voltages. A sense amplifier at the end of each bitline
detects this change in voltage and amplifies it to logic levels. During a read operation, some of the
charge stored in a cell is lost. As a result, the sense amplifiers must also drive their results back
onto the bitlines to refresh the cells’ contents, and further, every bitline in the unit array must have
a unique sense amplifier to perform this refresh. In this manner, an entire row of data is read from
a unit array at a time. A write operation is similar, except that selected bitlines are forced to either
high or low voltages prior to refresh.
The size of a unit array is determined by a combination of technology constraints and design
trade-offs. A primary concern in DRAM design is the ratio between the cell capacitance and the
bitline capacitance, as charge sharing between the cell and the bitline limits the ability to sense the
charge stored in a cell. This effectively restricts the length of a bitline and hence the number of
wordlines in a unit array. The number of bitlines, on the other hand, is limited by the parasitic
resistance and capacitance of the wordline, which slows the wordline select time. Further, because
each bitline in a unit array requires a sense amplifier that is activated with each read, power dissipation is also a concern in determining the number of bitlines. Because of these considerations,
typical unit array sizes have not grown appreciably between memory generations. As examples, in
The Case for Processing-in-Memory
6
Submitted for review, IEEE Computer, January 10, 1997
one IBM 16 Mbit DRAM2, the unit array size was 128 Kbits, organized as 512 wordlines and 256
bitlines. In a 1 Gbit DRAM recently announced by Samsung14, the unit array size is 256 Kbits,
organized as 512 wordlines and 512 bitlines.
Because all bitlines are sensed during a read operation, the bandwidth available from a unit array
is tremendous. To calculate this bandwidth, the number of cells per wordline is divided by the
delay from the time that a row address is strobed to the time that the sense amplifier output has
been latched in the row buffers. If the row delay for the 16 Mbit DRAM is conservatively estimated as 30 ns, the effective unit array bandwidth is 256 bits/ 30 ns = 1 GB/s. For the 1 Gbit
DRAM, where the time of an entire read cycle is 31 ns, the available bandwidth per unit array is
over 2 GB/s!
In order to determine the total available internal DRAM bandwidth per chip, the number and
placement of the unit arrays in the overall floorplan must be considered, as well as techniques for
routing data from the unit arrays to the interface circuitry (periphery). Fig. 3(a) shows a floorplan
similar to the IBM 16 Mbit DRAM. The entire chip contains 128 unit arrays, arranged in four
banks of 32 unit arrays, with two banks on either side of the centrally-located interface circuitry.
In this organization, data is routed to the interface circuitry via metal busses that run over the unit
arrays, with one data line multiplexed between 8 bitlines. The net result is that each of the four
banks presents 128 data lines to the interface. If each bank were turned on simultaneously, with a
row read time of 30 ns, the total internal bandwidth would be 2 GB/s.
In general, the higher the external bandwidth of a DRAM, the greater the amount of internal bandwidth that is wasted. In a 1996 ISSCC paper on their 1 Gbit synchronous DRAM, Mitsubishi
reports that only 16 K sense amplifiers are activated during each read7. Again assuming a very
conservative 30 ns access time, this amounts to an internal bandwidth of over 60 GB/s - enough
for our 2003 core CPU without caching. Although the external bandwidth of the chip is 0.2 GB/s
(through 64 data pins), this still represents less than 0.5 percent of the internal bandwidth at the
sense amplifiers.
If one went to the extreme and gave independent access to each unit array, then the peak bandwidth available in a 16 Mbit design (with 128 unit arrays) would be in excess of 128 GB/s, and a
The Case for Processing-in-Memory
7
Submitted for review, IEEE Computer, January 10, 1997
part like the 1 Gbit DRAM, with 4,096 unit arrays, would have 8 TB/s (terabytes/s) - enough
bandwidth to support 200 of our 2003 state-of-the-art microprocessor cores, and this is with
today’s technology!
4.0 Reclaiming the “Hidden Bandwidth”: The PIM Architectural
Spectrum
From the memory bandwidth equations presented earlier, it is clear that there are three ways of
improving the supply of data from the memory system to a processor: either increasing the number of data pins on the memory, reducing the average access time from the memory, or improving
the utilization of the data bus. The key to PIM is the ability to reclaim the vast “hidden bandwidth” from within the memory through optimization of each of these three parameters. More
than simply embedding processing logic and memory on the same chip, this viewpoint affects the
design at all levels, from the instruction set, to logic, to the physical floorplan.
PIM as a technology clearly offers some stunning improvements in basic memory parameters—so
much so that it permits returning to basic computer design with a “clean sheet of paper” view that
is not distorted by “classical” approaches. Although only perhaps a dozen significant PIM chips
have been designed to date, several distinct architectural directions have emerged. These directions vary in their generality of function, how amenable they are to scaling to multi-chip systems
at enhanced performance levels, and how effective they are in using the bandwidth potential on
chip. They also vary in how sensitive they are to technology, such as at what level of memory or
logic density they first become feasible, what ratios of logic to memory area the require, and how
effective they are at extracting usable performance from the silicon area that they consume.
To date we have identified seven distinct architectural directions that one might take in using PIM
technology. Each is discussed briefly below. Fig. 3 outlines some generic chip floorplans.
Moving The Cache onto the Memory
One clear approach to reducing the bandwidth gap between the memory and the processor is
reducing the average access time from memory. Although there are limits to the time that it takes
to access an individual memory cell (imposed by the ability to sense charge in a cell), the standard
memory organization discussed in the previous section makes it possible to exploit data locality,
The Case for Processing-in-Memory
8
Submitted for review, IEEE Computer, January 10, 1997
256 kb DRAM
256 kb DRAM
256 kb DRAM
256 kb DRAM
256 kb DRAM
DMA
CPU
256 kb DRAM
DMA
CPU
256 kb DRAM
DMA
CPU
256 kb DRAM
DMA
CPU
256 kb DRAM
CPU
DMA
256 kb DRAM
(a)
256 kb DRAM
CPU
DMA
256 kb DRAM
CPU
DMA
256 kb DRAM
CPU
DMA
256 kb DRAM
decode and interface logic
256 kb DRAM
256 kb DRAM
unit
array
(d)
4 Mb DRAM
RAM
I/F
RAM RAM
RAM
RAM
RAM
RAM
CPU
I/F
RAM RAM
RAM RAM
CPU
RAM
CPU
RAM
CPU
RAM
RAM RAM
CPU
RAM
RAM RAM
CPU
RAM
RAM RAM
RAM RAM
RAM RAM
RAM RAM
RAM RAM
RAM RAM
CPU
RAM RAM
CPU
RAM
RAM RAM
RAM
RAM
RAM
RAM
RAM
RAM RAM
RAM RAM
RAM RAM
RAM RAM
CPU
RAM RAM
RAM
I/F
CPU
CPU
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
word drivers and row decoder
CPU
CPU
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
Logic
RAM
(b)
CPU
RAM
CPU
RAM
4 Mb DRAM
RAM
Mem I/F
4 Mb DRAM
RAM
cache
cache
CPU
RAM
4 Mb DRAM
(e)
(c)
FIGURE 3. PIM floorplans (not to scale) (a) Commodity DRAM (b) Mitsubishi M32R/D (c)
Linden DAAM (d) EXECUBE (e) Shamrock
so that once a given address has been read, other addresses on the same row are latched into the
sense amplifiers and may be read very quickly. As a result, the average data access time decreases
for the typical addressing patterns that arise from the execution of computer programs. This natural “caching” ability within the memory is exploited in fast page mode or extended data out
(EDO) DRAMs. The idea is carried even further in the RAMBUS architecture, which has multiple internal cache lines to reduce the miss rate, as well as data encoding that improves pin utilization. To take this notion to the extreme, Saulsbury et. al. of Sun Microsystems showed that by
embedding a two-way set associative cache with only 32 lines that run the full width of the mem-
The Case for Processing-in-Memory
9
Submitted for review, IEEE Computer, January 10, 1997
ory in a 256 Mbit DRAM, memory latencies for the SPEC ‘95 benchmark suite could be so low
that a 200 MHz MicroSparc-II with such a memory system could have overall performance comparable to a 300 MHz Alpha 21164 (DEC 8200 5/300)10. In terms of silicon efficiency, because
the sense amplifiers are already in place, integrating the cache with the DRAM offers tremendous
savings in transistor count, and hence power, over a conventional, separate cache.
Moving the storage and logic for the cache function onto the memory chip is a first step towards
exploiting the benefits of Processing-in-Memory. Such an approach, however, is still limited by
the very narrow data channel for moving data off-chip, which in turn is constrained by the number
of pins. Only by moving the entire processor onto the memory chip can the full performance
potential be realized.
Fixed Function Processing
Because of the wide data access to memory, PIM provides natural opportunities for parallelism,
giving rise to a range of architectural possibilities. These architectures may be characterized by
the efficiency with which they can consume these very wide data words, the flexibility of the
design, and the overall complexity. Perhaps the most straightforward application of PIM technology in this manner is in the design of accelerators that perform bit-level arithmetic operations at
the memory row buffers. The most obvious candidate applications involve graphics processing,
with one or more layers of the graphics pipeline integrated onto a memory part that is used for display buffering. Functions might include Z-buffering, bit-blitting, texturing, etc., and could serve
as accelerators in conventional systems. Several such chips are in production now. Other such
PIMs are finding their way into network interface buffers.
Because these functions can be implemented with comparatively small amounts of logic and typically require lower clock rates, such chips have the key advantage of stressing the semiconductor
technology the least. Further, the amount of memory needed for an entry level part is fairly low;
for example, for a display with a million pixels, graphics PIMs become feasible with a low number of megabytes on chip—certainly within range of many different technologies today.
There are, however, two major disadvantages to such architectures. First, a non-programmable
chip has a relatively small range of applicability. Second, within one system such chips are not
The Case for Processing-in-Memory
10
Submitted for review, IEEE Computer, January 10, 1997
highly scalable; for example, in a graphics environment three simple PIMs—one per color—
would just about exhaust the back end of the graphics pipeline. Both of these disadvantages would
limit market volume, driving up the price that must be charged per chip to recover the design
efforts.
Single Chip Computing Systems
Another obvious direction for PIMs is the integration onto a single chip of a complete computing
system: memory, CPU, and major I/O subsystems. This is an extension of the single chip microcontrollers used for years, but with significantly more powerful CPUs and significantly larger
amounts of memory. There are several very significant advantages to single-chip system integration. First, single chip types simplify system designs. Second, wide off-chip memory and cache
busses can be eliminated, reducing pin count and (often significantly) power consumption. Third,
single chip systems also reap the benefits of integrating the cache onto the memory as discussed
earlier. Finally, as technology advances, the migration of the chip to either a smaller die or one
with more memory is direct.
There are, however, potential problems with single-chip system. The first set of problems arise
from technology limitations. Although total silicon area for a single-chip system is less than for a
multi-chip solution, it is all on one die, having a potentially adverse effect on yield. Further, if the
design targets both the densest available memory and a complex “state-of-the-art” CPU, it may
require the best features of memory and logic fabrication processing—discussed later in this
paper—which would greatly increasing chip cost. Additional problems may result from scalability. If one wants to build coherent shared memory systems, then there must be sufficient off-chip
contacts to carry the internode bandwidth, as well as on chip logic to handle the memory coherency issues. Both of these drive up the price of a part without providing additional functionality
when used in the most probable singleton configuration.
Existing PIMs of this type run a gamut of architectural sophistication. In the early 90s, a version
of Cal Tech’s MOSAIC chip integrated a small amount of planar DRAM onto the CPU chip. The
more recent Analog Devices SHARC line of DSPs use several megabits of fast SRAM for on-chip
memory connected in “classical” fashion to an on-chip CPU. This provides very high levels of
performance with a relatively simple design. The floorplan of the Mitsubishi M32R/D, shown in
The Case for Processing-in-Memory
11
Submitted for review, IEEE Computer, January 10, 1997
Fig. 3(c) has 16 megabits of DRAM, augmented with a small first level cache abutting each of the
two memory arrays, and a simple, pipelined CPU in the center of the chip between the two halves
of cache. The chip also provides the equivalent of a memory bus to allow the on-chip CPU to
access off-chip memory. The Mitsubishi PIM demonstrates many of the performance advantages
predicted by SUN’s simulations, namely, that a small cache embedded in the memory can reduce
the memory bottleneck to the point that a simple CPU can achieve near peak performance, comparable to that of a high-speed microprocessor but without the complexity.
An interesting observation of the above examples is that while they do successfully integrate a
processor onto a memory chip, they still use conventional processor architectures. None of these
use large fractions of the potential peak memory bandwidth, and the part that they do use is to fill
cache lines, not to drive computational function units directly. As such, alternative architectures,
such as VLIW, may in fact match PIM potentials better than today’s designs.
SIMD in Memory
One way to use more of the bandwidth of PIMs is to place multiple programmable dataflows very
near the sense amps, and to operate such dataflows in a SIMD parallel fashion. Data is read into
the dataflows from multiple on-chip memory stacks simultaneously, operated on simultaneously,
and then written back simultaneously.
For appropriate applications, such an approach is both simple and enormously effective. In many
cases, an almost unlimited number of chips can be ganged together for even more performance.
Finally, in terms of a ratio of memory to logic the entry point for building such devices is often
relatively low. In a real sense, the key chips that enabled such machines as the original MPP, the
DAP, the CM-2, and a variety of associative memory machines were all SIMD PIMs, albeit with a
very small amount of simple flip-flop memory per node. More recently, chips such as the Terasys
and the Linden DAAM have used denser SRAM or DRAM as the base memory technology, with
significantly more complex dataflows positioned very near the sense amps. The DAAM, outlined
in Fig. 3(c) in particular exhibits a huge internal parallelism—it has on one 4 megabit DRAM
4,096 separate 8 bit ALUs, all capable of processing simultaneously.
There are two limitations of such PIMs. First, because they are SIMD, they are still limited in
their programmability. Second, although they are capable of huge peak performances when the
The Case for Processing-in-Memory
12
Submitted for review, IEEE Computer, January 10, 1997
operands for each dataflow are in that dataflow’s memory bank, the performance can often plummet tremendously if operands for one dataflow must come from some distant memory bank, especially if it is off chip. Not only must chip contacts be made available to carry such traffic, but a
great deal of complexity may be expended in programming the transfers in an efficient manner.
SMP PIMs
The next category of PIMs attempts both to use more of the on-chip bandwidth and to reduce the
inter-dataflow communication problem by placing multiple complete processors in a shared memory Symmetric Multi Processors (SMP) configuration on-chip. Current designs of this type, such
as the Texas Instruments MVP, partition the on-chip memory into a small number of essentially
conventional looking memory modules, and then implement a cross-bar or equivalent network to
interconnect them to the processors. The MIT MAP project is similar, placing four VLIW-like
CPUs on a chip with 128KB of shared SRAM, although in this case the on-chip memory is treated
as cache for off chip DRAM.
Memory density is today the major constraint in the construction of such chips. For the most part,
current chips have used low density SRAM, with the result that for many real applications the onchip processors are memory starved, and spend a great deal of time importing and exporting data
that will not all fit on chip.
While this design approach also can result in a very effective single chip design, it still runs into
complexity problems when more than one such chip is to be combined into a single system. The
reason is that the architectural mechanisms needed to tie multiple chips together (either I/O ports
or again some sort of shared memory bus with coherence support) are fundamentally different
than that used between processors on-chip. This places a great burden on the programmer, who
must now know when to invoke which mechanism. Further, such programs are going to be very
fragile as technology advances. Increasing the number of processors on-chip, for example, may
totally destroy a program written for an earlier generation PIM.
Tiled PIMs
Another approach to parallel PIM designs takes are more holistic view. The silicon area is recognized up front as a two dimensional surface onto which logic areas and memory areas can be
The Case for Processing-in-Memory
13
Submitted for review, IEEE Computer, January 10, 1997
“tiled.” Several architectural goals govern the tiling arrangement. First, processor and memory
components are interleaved so that maximal memory bandwidth can be provided in the richest
possible manner to the maximum number of processing units. Second, the depth of each memory
bank can be adjusted to provide appropriate processing-to-storage ratios for a class of related
applications. Finally, the interconnections between memory and processor components should be
simple, uniform, high bandwidth, and scalable, within a single chip as well as between multiple
chips. As with SMP chips, the major constraint to the construction of such chips is memory density. Matching the inter node interconnection to what comes naturally on the surface of a chip,
however, can both simplify silicon use and provide simpler programming models.
Perhaps the first step in the direction of tiled PIMs was the EXECUBE chip4,13, shown in Fig.
3(d). In this design, two 32 KB DRAM memory unit arrays were teamed with a complete standalone 16 bit CPU, with eight of these combination nodes tiled over a single die. Each such node
included support for four DMA links. Three DMA links from each node were used to form a low
wiring complexity binary hypercube of computers on chip, with the remaining link attached to
off-chip contacts. Thus large combinations of chips could be constructed by simple interconnection of DMA links between neighboring nodes, with a single interconnection protocol. This also
formed the basis for the programming model.
Because of the origins of the memory macros used in EXECUBE, only limited access from the
CPU’s instruction set to all of the sense amps was possible. Even so, under certain modes, several
gigabytes per second of on-chip memory bandwidth could be profitably consumed. Our current
work is carrying this design direction even further with the Shamrock PIM architecture6. In the
Shamrock floorplan, shown in Fig. 3(e), bands of processors and memory banks alternate in
stripes across the chip, with the sense amplifiers of the memory facing the processors. Identical
processing nodes will employ a combination of SIMD dataflows and a VLIW-like CPU.
A key innovation in Shamrock is to provide access to the memory from either end of the banks.
This allows processors on either side the best possible interconnection mechanism—a true shared
memory link. By offsetting the positions of the processing nodes on different rows, a topology
similar to a 2-d mesh is formed, where each node has true shared memory access to four other
The Case for Processing-in-Memory
14
Submitted for review, IEEE Computer, January 10, 1997
nodes. Appropriate floorplanning then allows interface logic on the periphery of the chip to connect to memory banks on other chips, again with a single uniform programming model.
Reconfigurable PIM
As a low-cost, quick-turnaround approach to PIM development, we have proposed the idea of a
reconfigurable PIM gate-array. Sized at approximately 32 Mbits of high-density DRAM and 250
K logic gates, such a chip could accommodate a wide variety of applications, such as accelerators
for image and signal processing. The floorplan for a reconfigurable PIM chip would contain multiple memory banks and fixed-point dataflows pitch-matched to the memory arrays. The chip
would also contain regions of seas-of-gates or gate array for implementing other logic,control,
and interface functions.
5.0 Challenges in Implementation
For the past several decades, the conventional approach to computer system development has
been to separately optimize the memory and processor components. This has led not only to a performance gap but also to a technology gap: processor and memory chips today are manufactured
on fundamentally different fabrication processes and designed using fundamentally different
methodologies.
Logic and DRAM fabrication processes have significant differences in terms of both transistor
and interconnect technology. Logic processes are designed to support both the fastest possible
switching speeds as well as compact routing between modules. This has led to complex processes
with many expensive steps. To reduce the layout area, four or more layers of metal interconnect
are common, which requires wafer surface planarization before the deposition of each metal layer.
Transistor device characteristics are also carefully optimized to minimize delay: for example, the
polysilicon gates of PMOS and NMOS transistors typically are doped separately to compensate
for work function differences.
In the evolution of DRAM fabrication processes, the primary goals have been minimizing the area
and maximizing charge retention of the memory cells, while keeping the overall manufacturing
cost as low as possible. As a result, efforts have focused chiefly on optimizing the design of the
memory cell, with minimal emphasis on the transistors and interconnect in the periphery. Today’s
The Case for Processing-in-Memory
15
Submitted for review, IEEE Computer, January 10, 1997
DRAM cells have highly complex geometries that use either trench capacitors that extend deep
into the bulk of the silicon, or stacked capacitors that have fin-like projections layered above the
silicon surface. Both of these structures have far superior capacitance per unit area than planar
capacitors that could be implemented with pure logic fabrication processes. Peripheral logic transistors in DRAMs, however, typically do not have the optimizations for speed that would be found
in custom logic processes, as this would increase cost without significantly improving overall performance. As a result, DRAM periphery transistors are approximately half as fast as optimized
logic transistors. Further, DRAM fabrication processes typically have only one or two layers of
metal interconnect, as compared to four or more for logic processes.
An integrated PIM technology requires the densest possible memory. It would also require reasonably fast transistors and sufficient layers of interconnect, yet as performance is gained through
architecture, these may not have to be optimized to the degree as commonly found in high-performance logic processes. In time, such process enhancements may be made, but in general a PIM
fabrication process would be based upon a commodity DRAM process. In order to assess the
manufacturability and economic feasibility of PIM product, new yield models would have to be
developed that account for defects in either the logic or the memory array, and that consider
redundancy or error correction.
As a result of the differences in function and technology, logic and DRAM design methodology
has also evolved along very different lines. In logic design, some of the main challenges are at the
highest levels of abstraction: determining a component structure that describes very complex
behavior, and then verifying that this structure is correct. This has led to a library-based approach
to design, using high-level synthesis and simulation CAD tools, as well as highly-automated component placement and routing. As much as is possible, detailed design and analysis at the transistor level is abstracted away from the logic design process.
Because memories are largely analog designs, concerns at the transistor level are paramount.
Electrical characteristics of the cells are determined through detailed 2-d device simulations.
Extensive Spice simulation is required to adjust the D.C. and timing performance of the sense
amplifiers and other peripheral circuitry. Circuit layouts are largely hand-crafted in order to minimize area, as well as to match the strict spacing requirements imposed by the cell array.
The Case for Processing-in-Memory
16
Submitted for review, IEEE Computer, January 10, 1997
The challenge in developing a PIM design methodology goes beyond merely reconciling the differing practices and CAD tool suites currently used for logic and memory design. In order to be
effective, the methodology must readily grant logic designers access to the full internal bandwidth
of the memory, yet shield them from the detailed analog behavior. In order to to use the highest
memory densities, the PIM design methodology must be synchronized with commodity DRAM
technology cycles, with 18 months development and 18 months sustained processing. Further,
because of the highly specialized nature of DRAM processing, a PIM design methodology must
be “open,” so that design houses can interact readily with the foundry.
The key to an open design methodology for PIMs is to base it upon the use of memory macros
that provide a “black box” view of the memory for the logic designer. Given such memory macros, a library-based, ASIC-like design flow for PIMs would be possible. Such a memory macro
would be designed to encapsulate a unit array from a high-density commodity DRAM, and would
serve two purposes. First, it would hide proprietary and unnecessary details from the logic
designer, such as cell design, internal timing, and redundancy. Second, it would provide architectural enhancements such as an extremely high bandwidth digital interface that would latch data
just after the sense amplifiers. To work with the memory macro, logic libraries would be developed that are pitch-matched to the spacing constraints imposed by the cell array. In order to allow
new PIM development to keep pace with the rapidly evolving technology cycles, new CAD tools
would have to be developed, including DRAM macro generators and pitch-matching datapath
synthesizers.
Power management is clearly another issue in PIM design. In traditional commodity memory
parts, where the goal is to produce a small, select amount of data at the pins, efforts are made to
activate as few sense amplifiers as possible during accesses in order to avoid large power spikes.
To produce high internal memory bandwidth for PIM applications, however, as many sense amplifiers as is feasible should be turned on in a single cycle.
A final area of concern for PIM designs is testing. In general, test methodologies for logic and for
memory are very different. Logic testing entails running long sequences of input vectors that collectively try to sensitize any defects in the logic paths. In memory testing, multiple complex
address and data patterns are run to uncover any defects that could result from actual code execution. Because, in a sense, logic testing requires hardware that emulates memory and memory testThe Case for Processing-in-Memory
17
Submitted for review, IEEE Computer, January 10, 1997
ing requires hardware that emulates a processor, the potential exists for highly efficient built-in
self-testing for PIMs. Such an approach was explored with EXECUBE, were a test program was
initially loaded into memory to verify the basic functionality of a processor, which could then perform more exhaustive testing of the memory by local program execution.
6.0 Conclusions and Future Projections
Clearly, the current trend of improving processor performance through exponential increases in
architectural complexity cannot be sustained indefinitely. By reducing memory latency, providing
wide data access for greatly increased memory bandwidth, and opening new opportunities for
architectures with natural parallelism, PIM avoids the processor/memory bottleneck that plagues
performance in today’s systems. This, and the fact that PIM offers unmatched levels of silicon
efficiency and scalability, leads us to conclude that PIM is inevitably the next major evolution in
computer architecture.
What might commodity PIMs look like? Because the processing logic is placed in the memory,
with PIM there may no longer a need for external, passive memory in the conventional sense.
Instead, systems may be scaled by adding more PIM modules, onto the same chip or onto different chips. Ultimately, one may envision large amounts of PIM replacing conventional memory in
the form of a “PIM-SIMM” where applications can either run in a conventional “host” microprocessor, or in the memory next to the data. This vision is the one we believe to offer the most direct
and pervasive application of PIM, and one that at the same time utilizes all its capabilities in a reasonable programming model.
7.0 Bibliography
1. M. Bekerman and Avi Mendelson, “A Performance Analysis of Pentium Processor Systems,”
IEEE Micro, Oct. 1995, pp. 72-83.
2. W. Ellis, et al., “Multipurpose DRAM architecture for optimal power, performance, and
product flexibility,” IBM J. of Research and Development, Jan./March 1995, pp. 51-62.
3. B. Joy. “Microprocessor Architecture: The Next 10 Years" keynote address, HOT Chips VIII,
Stanford, CA, August 18-20, 1996.
The Case for Processing-in-Memory
18
Submitted for review, IEEE Computer, January 10, 1997
4. P. Kogge, “EXECUBE - A new architecture for scalable MPPs,” Proc. Int. Conf. on Parallel
Processing, St. Charles, IL, Aug. 1994, pp. 77-84.
5. P. Kogge, et al., “Combined DRAM and logic chip for massively parallel systems,” Proc. 16th
Conf. on Adv. Research in VLSI, Raleigh, NC, Mar. 1995, pp. 4-13.
6. P. Kogge, S. Bass, J. Brockman, D. Chen, E. Sha, “Pursuing a Petaflop: Point Designs for 100
TF Computers Using PIM Technologies,” Proc. Frontiers of Massively Parallel Computation,
Annapolis, MD, Oct., 1996, pp. 88-97.
7. Y. Nitta, et al., “A 1.6GB/s Data-Rate 1 Gb Synchronous DRAM with Hierarchical SquareShaped Memory Block and Distributed Bank Architecture,” IEEE Solid State Circuits Conf.,
San Francisco, CA, Feb. 1996, pp. 376-377.
8. D. Patterson, T. Anderson, and K. Yelick. “A Case for Intelligent DRAM:IRAM,” HOT Chips
VIII, Stanford, CA, August 18-20, 1996, pp. 75-94.
9. B. Prince, Semiconductor Memories, second edition, Wiley, 1991
10. A. Saulsbury, F. Pong, and A. Nowatzyk, “Missing the Memory Wall: The Case for Processor/
Memory Integration,” Proc. of 23rd ISCA, June, 1996.
11. Semiconductor Industry of America, The National Technology Roadmap for Semiconductors,
SIA, 4300 Stevens Creek Blvd, Suite 271, San Jose, CA 95129, 1994.
12. R. Sites, “It's the Memory, Stupid,” Microprocessor Report, Vol. 10, No. 10, August 5, 1996,
pp. 18-24.
13. T. Sunaga, et al., “A Parallel Processing Chip with Embedded DRAM Macros,” IEEE J. of
Solid-State Circuits, Vol. 31, No. 10, Oct. 1996, pp. 1556-1559.
14. J. Yoo, et al, “A 32-Bank 1 Gb DRAM wirh 1GB/s Bandwidth,” IEEE Solid State Circuits
Conf., San Francisco, CA, Feb. 1996, pp. 378-379.
The Case for Processing-in-Memory
19
Download