Implementing Logic in FPGA Memory Arrays: Heterogeneous

advertisement
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory
Architectures
Steven J.E. Wilton
Department of Electrical and Computer Engineering
University of British Columbia
Vancouver, BC, Canada, V6T 1Z4
stevew@ece.ubc.ca Abstract
It has become clear that large embedded configurable
memory arrays will be essential in future FPGAs. Embedded arrays provide high-density high-speed implementations of the storage parts of circuits. Unfortunately, they
require the FPGA vendor to partition the device into memory and logic resources at manufacture-time. This leads to
a waste of chip area for customers that do not use all of the
storage provided. This chip area need not be wasted, and
can in fact be used very efficiently, if the arrays are configured as large multi-output ROMs, and used to implement
logic.
In this paper, we investigate how the architecture of the
FPGA embedded arrays affects their ability to implement
logic. Specifically, we focus on architectures which contain
more than one size of memory array. We show that these
heterogeneous architectures result in significantly denser
implementations of logic than architectures with only one
size of memory array. We also show that the best heterogeneous architecture contains both 2048 bit arrays and 128
bit arrays.
1 Introduction
On-chip storage has become an essential component of
high-density FPGAs. The large systems that will be implemented on these FPGAs often require storage; implementing this storage on-chip results in faster clock frequencies and lower system costs. Two implementations of onchip memory in FPGAs have emerged: fine-grained and
coarse-grained. In FPGAs employing fine-grained on-chip
storage, such as the Xilinx 4000 FPGAs, each lookup table can be configured as a small RAM, and these RAMs
This work was supported by the Natural Sciences and Engineering
Research Council of Canada, and UBC’s Centre for Integrated Computer
Systems Research.
can be combined to implement larger user memories [1].
FPGAs employing the coarse-grained approach, on the
other hand, contain large embedded arrays which are used
to implement the storage parts of circuits. Examples of
such devices are the Altera 10K, Apex, and Stratix devices [2, 3, 4], the Xilinx Virtex and Virtex II FPGAs [5],
the Actel 3200DX and SPGA parts [6, 7], and the Lattice
ispLSI FPGAs [8].
The coarse-grained approach results in significantly
denser memory implementations, since the per-bit overhead is much smaller [9]. Unfortunately, it also requires
the FPGA vendor to partition the chip into memory and
logic regions when the FPGA is designed. Since circuits
have widely-varying memory requirements, this “averagecase” partitioning may result in poor device utilizations for
logic-intensive or memory-intensive circuits. In particular,
if a circuit does not use all the available memory arrays
to implement storage, the chip area devoted to the unused
arrays is wasted.
This chip area need not be wasted, however, if the unused memory arrays are used to implement logic. Configuring the arrays as ROMs results in large multi-output
lookup-tables that can very efficiently implement some
logic circuits. In [10], a new tool, SMAP, was presented
that packs as much circuit information as possible into the
available memory arrays, and maps the rest of the circuit
into four-input lookup-tables. It was shown that this technique results in extremely dense logic implementations for
many circuits; not only is the chip area of the unused arrays
not wasted, but it is used more efficiently than if the arrays
were replaced by logic blocks. Thus, even customers that
do not require storage can benefit from embedded memory
arrays.
The effectiveness of this mapping technique, however,
is very dependent on the architecture of the embedded
memory arrays. If the arrays are too small, the amount
of logic that can be packed into each will be small, while
if the arrays are too large, much of each array will be
unused. Previous studies have focused on the architecture of these memory resources when implementing storage [11, 12, 13]. Since they are so effective at implementing logic, however, it is important that the design of the
embedded memory arrays also consider this.
In [14], the the effects of array depth, width, and flexibility of memory arrays when they are used to implement
logic were explored. That paper, however, only considered
homogeneous memory architectures, ie. architectures in
which each memory array is identical. In this paper, we
show that significant density improvements are possible if
the FPGA contains a heterogeneous memory architecture,
that is, an architecture with more than one size of memory
array.
The goals of this paper are as follows:
1. The first goal is to quantify the density improvements
that are possible with a heterogeneous memory architecture (compared to a homogeneous memory architecture) when used to implement logic.
2. There are many possible heterogeneous memory architectures (different array sizes, numbers, etc.). The
second goal of this paper is to find the heterogeneous
memory architecture that can most efficiently implement logic.
The architectural space explored in this paper is described in Section 2. Section 3 describes the experimental
methodology and reviews the SMAP algorithm. Finally,
Section 4 presents experimental results.
2 Embedded Array Architectures
Table 1 summarizes the parameters that define the
FPGA embedded memory array architecture, along with
values of these parameters for several commercial devices.
In this paper we are considering architectures with two different array sizes; we denote the number of bits in each
type of array as ½ and ¾ . The number of each type of
arrays is denoted ½ and ¾ . We assume that all arrays
have the same set of allowable data widths, and denote that
set by eff . For a fixed size, a wider memory implies fewer
memory words in each array. In the Altera FLEX10K for
example,
bits, and eff , meaning
each array can be configured to be one of 2048x1, 1024x2,
512x4, or 256x8.
3 Methodology
To compare memory array architectures, we employed
an experimental methodology in which we varied the various architectural parameters, and mapped a set of 28
L
N
P
H
J
K
N
M
G
D
P
M
Q
Q
F
C
B
A
E
C A
F
E
a) Original Circuit
b) Final Implementation
Figure 1: Example Mapping to a 8-Input, 3-Output Memory Block
benchmark circuits to each architecture. Each circuit contained between 527 and 6598 4-LUTs. Fifteen of the circuits were sequential. The combinational circuits and 9
of the sequential circuits were obtained from the Microelectronics Corporation of North Carolina (MCNC) benchmark suite, while the remaining sequential circuits were
obtained from the University of Toronto and were the result of synthesis from VHDL and Verilog. All circuits were
optimized using SIS [15] and mapped to four-input lookuptables using Flowmap and Flowpack [16]. The SMAP algorithm was then used to pack as much circuit information
as possible into the available memory arrays. The number
of nodes that can be packed to the available arrays is used
as a metric to compare memory array architectures.
The results in this paper depend heavily on the SMAP
algorithm, which was originally developed for architectures in which all arrays are the same size. The following subsection reviews SMAP, while the subsequent subsection shows how SMAP can be used to map logic to a
heterogeneous memory architecture.
3.1
Review of SMAP
This section briefly reviews SMAP; for more details,
see [10].
The SMAP algorithm is based on Flowpack, a postprocessing step of Flowmap [16]. Given a seed node, the
algorithm finds the maximum-volume k-feasible cut, where
is the number of address inputs to each memory array. A
-feasible cut is a set of no more than nodes in the faninnetwork of the seed such that the the seed can be expressed
entirely as a function of the nodes; the maximum-volume
-feasible cut is the cut which contains the most nodes between the cut and the seed. The nodes that make up the
cut become the memory array inputs. Figure 1(a) shows an
example circuit along with the the maximum 8-feasible cut
for seed node A.
Given a seed node and a cut, SMAP then selects which
nodes will become the memory array outputs. Any node
that can be expressed as a function of the cut nodes is a potential memory array output. The selection of the outputs
Parameter
½
¾
½
¾
eff
Meaning
Number of Type-1 Arrays
Number of Type-2 Arrays
Bits per Type-1 Array
Bits per Type-2 Array
Allowable Data Widths
Altera 10K
3-16
2048
1,2,4,8
Commercial Devices
Vantis VF1 Lattice isp6192
28-48
1
128
4608
4
9,18
Range in
this paper
1-9
1-9
128-8192
128-8192
1,2,4,8
Table 1: Architectural Parameters
is an optimization problem, since different combination of
outputs will lead to different numbers of nodes that can be
packed into the arrays. In [10], a heuristic was presented;
the outputs with the largest number of nodes in their maximum fanout-free cone (maximum cone rooted at the potential output such that no node in the cone drives a node not
in the cone) are selected. As shown in [10], those nodes
in the maximum fanout-free cones of the outputs can be
packed into the array. All other nodes in the network must
be implemented using logic blocks. In Figure 1(a), nodes
C, A, and F are the selected outputs; Figure 1(b) shows the
resulting circuit implementation.
Since the selection of the seed node is so important, we
repeat the algorithm for each seed node, and choose the
best results.
If there is more than one array available, we map to the
first array as described above. Then, we remove the nodes
implemented by that array, and repeat the entire algorithm
for the second array. This is repeated for each available
array.
3.2
Extension to Heterogeneous Memory Architectures
The SMAP algorithm was developed assuming a homogeneous memory architecture; that is, one in which each
memory array is identical. Since the arrays are packed one
at a time, the above algorithm can be applied directly to
architectures with different sized memory arrays. The only
issue is whether the large or small arrays should be filled
first. Experimentally, we have determined that the best
results are obtained if we fill all of the large arrays first.
The SMAP algorithm is greedy, in that, for each array, the
largest portion of logic that can be mapped to the array is
selected. Thus, the largest gains are likely to be obtained
from the first few arrays that are filled; therefore it makes
sense that these first few arrays are the large ones.
4 Results
4.1
Homogeneous Architecture Results
We first consider architectures in which all arrays are
of the same size (this is the homogeneous case considered
in [14]). Figure 2 shows how the effectiveness of each
memory array in implementing logic depends on the array
size, assuming 8 arrays are available. Figure 2(a) shows the
number of logic blocks that can be packed into the arrays
(averaged over our 28 benchmark circuits) vs. array size.
Figure 2(b) shows the estimated chip area of the 8 memory
arrays, also as a function of array size. The area estimates
were obtained from a detailed area model [17] and are expressed in logic block equivalents (LBE). One LBE is the
area required to implement one logic block.
Figure 2(c) shows the packing density as a function of
array size. The packing density is defined as the ratio of
the number of logic blocks that can be packed into the
available memory arrays over the area required to implement the memory arrays (in LBEs). A packing density of
1 means that the density of logic implemented in memory
arrays is equal to that if the logic was implemented in logic
blocks. A packing density greater than 1 means that the
density of logic implemented in memory arrays is greater
than that if logic blocks were used. As Figure 2(c) shows,
the packing density is greater than 1 for all but the largest
memory array. The highest packing density occurs when
the arrays each contain 512 bits. See [14] for a more thorough coverage of homogeneous architectures.
4.2
Heterogeneous Architecture Results
In this section, we consider architectures which contain
two different sizes of memory arrays. Using the terminology of Section 2, each FPGA will have ½ arrays of ½ bits
each and ¾ arrays of ¾ bits each. We restrict our attention to architectures with three different ratios of ½ ¾ :
1:1, 1:2, and 1:3.
Figure 3 shows the packing density for several sizes of
¾ (that is, there
½ and ¾ , assuming the ratio ½
3
350
300
250
200
150
100
50
0
128
256
512 1024 2048
Bits per Array
4096
300
250
200
150
100
50
0
8192
Packing Ratio
Area (equiv. logic blocks)
Packed Logic Blocks
350
2
1
128
256
a) Logic Blocks Packed
512 1024 2048
Bits per Array
4096
128
8192
256
b) Area
512 1024 2048
Bits per Array
4096
8192
c) Packing Ratio
Figure 2: Homogeneous Architecture Results, 8 arrays
Array 2 size (B2)
Array 1 size (B1)
128
256
512
1024
2048
4096
8192
2.04
2.17
2.10
2.67
2.61
2.77
2.79
2.73
2.86
2.73
3.42
3.33
3.27
2.98
2.63
2.43
2.41
2.40
2.28
2.04
1.63
1.55
1.56
1.57
1.53
1.43
1.24
0.99
Packing Density
3.5
3.0
2.5
2.0
1.5
128
256
512
1024
2048
4096
8192
a) Numerical Results
1.0
8192
4096
2048
8192
4096
2048
1024
1024
512
512
256
B2
256
128
128
B1
b) Graphical Results
Figure 3: Heterogeneous Architectures, 4 arrays of each type
are four of each kind of array). As the results show, the
best packing density occurs when there are four arrays of
2048 bits each, and four arrays of 128 bits each (we did not
consider array sizes smaller than 128 bits, since such small
arrays would not be suitable for implementing the memory
parts of circuits, and thus, would not likely be considered
by an FPGA manufacturer). The packing density at this
point is 23% higher than the best packing density obtained
for homogeneous architectures.
We repeated the experiments for several values of ½
and ¾ ; selected graphical results are shown in Figure 4.
In Figure 4(a), one of each type of array is assumed. In this
case, the best architecture is a homogeneous architecture
in which both arrays contain 2048 bits. This was the only
configuration for which a homogeneous architecture was
found to be the best.
Results for FPGAs with the ratio ½ ¾
(that is, FPGAs for which there are twice as many type-2
arrays as type-1 arrays) are shown in Figure 4(c) and (d).
(three
Results for FPGAs with the ratio ½ ¾
times as many type-2 arrays as type-1 arrays) are shown in
Figure 4(e) and (f). In both cases, the best architecture was
found to consist of 2048 bit arrays and 128 bit arrays (this
was the case for all architectures which we investigated,
except the ½ ¾ case as described above).
It is interesting to note that although an FPGA with both
128 bit arrays and 2048 bit arrays was found to be best,
in some cases, (Figures 4(c) and (e)) the majority of the
arrays should contain 2048 bits, while in other cases, the
majority of the arrays should contain 128 bits (Figures 4(d)
and (f)). This can be observed in the graphs by noticing that
in Figures 4(c) and (e), the highest point is to the “left” of
the center of the graph, while in Figure 4(d) and (f), the
highest point is to the “right” of the center of the graph.
We have investigated other architectures with a ½ ¾
ratio of and , and have confirmed that, as the
total number of arrays increases, the preference for smaller
arrays increases. Intuitively, if there are more arrays, the
SMAP tool is less able to effectively fill the larger arrays
with logic.
A second conclusion that can be drawn from the results
in Figure 4 (and confirmed by other experiments we have
performed) is that as the total number of arrays increases,
the advantage due to heterogeneous architectures (compared to homogeneous architectures) tends to increase. If
there are only two arrays, a homogeneous architecture is
better, while if there are 12 arrays (Figures 4(d) and (f)),
the heterogeneous architecture is considerably better (22%
better in each case).
5 Conclusions
Although embedded arrays in FPGAs were developed
in order to implement on-chip storage, it is clear that these
arrays can also be configured as ROMs and used to implement logic. In this paper, we have shown that significant
density improvements are possible if the FPGA contains
a heterogeneous memory architecture, that is, an architecture with more than one size of memory array. The amount
of improvement depends on how many memory arrays are
present; if there are eight arrays, we have shown that the
best heterogeneous architecture can implement logic 23%
more efficiently than the best homogeneous architecture.
In virtually all cases, we have found that the best heterogeneous architecture consists of some 2048 bit arrays,
and some 128 bit arrays. The exact number of each size of
array depends on the total number of arrays available; the
more arrays that are present, the larger the proportion that
should be 128 bits.
We have also shown that the benefits of heterogeneous
architectures become more significant as the number of arrays increase. This is a compelling argument for heterogeneous memory architectures. Future architectures are
likely to contain more memory than they do now; FPGAs with such large memory capacities would significantly benefit if a heterogeneous architecture is used.
References
[1] Xilinx, Inc., Virtex 2.5 V Field Programmable Gate Arrays,
ver. 1.6, July 1999.
[2] Altera Corporation, FLEX 10K Embedded Programmable
Logic Family Data Sheet, ver. 4.1, Mar 2001.
[3] Altera Corporation, APEX 20K Programmable Logic Device Family Data Sheet, ver. 2.1, Feb 2002.
[4] Altera Corporation, Stratix Programmable Logic Device
Family Datasheet, 2002.
[5] Xilinx, Inc., XC4000E and XC4000X Series Field Programmable Gate Arrays, ver. 1.6, May 1999.
[6] Actel
Corporation,
Datasheet:
Field-Programmable Gate Arrays, 1995.
3200DX
[7] Actel Corporation, Actel’s Reprogrammable SPGAs, 1996.
[8] Lattice Semiconductor Corporation, Datasheet: ispLSI and
pLSI 6192 High Density Programmable Logic with Dedicated Memory and Register/Counter Modules, July 1996.
[9] T. Ngai, J. Rose, and S. J. E. Wilton, “An SRAMProgrammable field-configurable memory,” in Proceedings
of the IEEE 1995 Custom Integrated Circuits Conference,
pp. 499–502, May 1995.
[10] S. J. E. Wilton, “SMAP: heterogeneous technology mapping
for
FPGAs
with
embedded memory arrays,” in ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, pp. 171–
178, February 1998.
[11] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, “Architecture of centralized field-configurable memory,” in Proceedings of the ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 97–103, 1995.
[12] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, “Memory/logic interconnect flexibility in FPGAs with large embedded memory arrays,” in Proceedings of the IEEE 1996
Custom Integrated Circuits Conference, pp. 144–147, May
1996.
[13] S. J. E. Wilton, J. Rose, and Z. G. Vranesic, “Memoryto-memory connection structures in FPGAs with embedded
memory arrays,” in ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays, pp. 10–16, February
1997.
[14] S. J. E. Wilton, “Implementing logic in FPGA embedded
memory arrays: Architectural implications,” in IEEE Custom Integrated Circuits Conference, May 1998.
[15] E. Sentovich, “SIS: A system for sequential circuit analysis,” Tech. Rep. UCB/ERL M92/41, Electronics Research
Laboratory, University of California, Berkeley, May 1992.
[16] J. Cong and Y. Ding, “FlowMap: an optimal technology
mapping algorithm for delay optimization in lookup-table
based FPGA designs,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 13,
pp. 1–12, January 1994.
[17] S. J. E. Wilton, Architectures and Algorithms for FieldProgrammable Gate Arrays with Embedded Memory. PhD
thesis, University of Toronto, 1997.
Packing Density
Packing Density
4.0
3.5
3.0
2.5
2.0
8192
4096
8192
4096
2048
1024
2048
1024
512
a)
256
¼
,
0.5
128
½
128
4096
2048
1024
B1
512
512
B2
b)
¼
256
256
,
128
½
B1
128
3.0
3.0
2.5
2.0
4096
8192
4096
2048
1024
2048
1024
Packing Density
Packing Density
8192
4096
2048
1024
1.0
256
3.5
512
c)
256
4096
½
,
128
¾
512
e)
½
,
256
128
¾
128
512
512
B2
Packing Density
8192
4096
2048
1024
B2
1024
d)
2.0
256
2048
2.5
512
8192
4096
2048
1024
B1
128
3.0
1024
1.5
256
3.5
2048
2.0
1.0
8192
4.0
4096
2.5
512
B2
Packing Density
1.5
8192
4.0
1.5
8192
2.0
512
B2
1.5
1.0
8192
2.5
½
256
256
,
128
¾
B1
128
2.5
2.0
1.5
8192
4096
2048
1024
1.0
8192
4096
2048
1024
B1
512
512
B2
f)
½
256
256
,
Figure 4: Other Selected Heterogeneous Architecture Results
128
¾
128
B1
Download