mrFPGA: A Novel FPGA Architecture with Memristor

advertisement
mrFPGA: A Novel FPGA Architecture with Memristor-Based
Reconfiguration
Jason Cong
Bingjun Xiao
Department of Computer Science
University of California, Los Angeles
{cong, xiao}@cs.ucla.edu
dominant part in FPGA) gains some improvement, the gap between
FPGAs and ASICs will be significantly reduced.
Abstract— In this paper, we introduce a novel FPGA architecture
with memristor-based reconfiguration (mrFPGA). The proposed
architecture is based on the existing CMOS-compatible memristor
fabrication process. The programmable interconnects of mrFPGA
use only memristors and metal wires so that the interconnects can
be fabricated over logic blocks, resulting in significant reduction
of overall area and interconnect delay but without using a 3D diestacking process. Using memristors to build up the interconnects can
also provide capacitance shielding from unused routing paths and
reduce interconnect delay further. Moreover we propose an improved
architecture that allows adaptive buffer insertion in interconnects
to achieve more speedup. Compared to the fixed buffer pattern in
conventional FPGAs, the positions of inserted buffers in mrFPGA
are optimized on demand. A complete CAD flow is provided for
mrFPGA, with an advanced P&R tool named mrVPR that was
developed for mrFPGA. The tool can deal with the novel routing
structure of mrFPGA, the memristor shielding effect, and the
algorithm for optimal buffer insertion. We evaluate the area, performance and power consumption of mrFPGA based on the 20 largest
MCNC benchmark circuits. Results show that mrFPGA achieves
5.18x area savings, 2.28x speedup and 1.63x power savings. Further
improvement is expected with combination of 3D technologies and
mrFPGA.
Conventional FPGA suffers a lot from its programmable interconnects partly due to extensive use of SRAM-based programming bits,
multiplexers (or pass transistors), and buffers in the interconnects.
One SRAM cell contains as many as six transistors, but can store
only one-bit data. The low density of SRAM-based storage increases
the area overhead of FPGA programmability and consequently, leads
to longer routing paths and larger interconnect delay. In addition
SRAM is a type of volatile memory – this means that it contributes
to excessive power consumption during stand-by.
Emerging technologies, especially emerging non-volatile memory
(NVM) technologies, lead to opportunities of circuit improvement.
Emerging NVMs typically have a smaller cell size than that of
SRAMs. They also have the desirable property of non-volatility,
which means that they can be turned off during stand-by to save
power. Among them, memristor, also often recognized as resistive
random-access memory (RRAM), is considered to be the most
promising one. Since HP Labs presented the first experimental
realization of memristor in 2008 [8], rapid progress on the fabrication
of high-quality memristor has been achieved in the past few years.
The current leading memristor technology for the time being has
outperformed many other emerging NVMs in some important aspects
and performs similar to them in the other aspects. It has been
demonstrated that memristor is scalable below 30nm [9], can be
fabricated with device size as small as F2 (F is the feature size) with
full compatibility of CMOS [10], and can be programmed within
5ns at 180nm technology node [11]. Due to the advantages of its
device property over SRAM and other emerging NVMs, numerous
research efforts have been made to adopt memristors for systemlevel integration [12–15]. However almost all of these studies use
memristors as a new type of memory.
Keywords-FPGA; memristor; ASIC; reconfiguration;
I. I NTRODUCTION
The performance/power efficiency of an application implemented in
an application specific integrated circuit (ASIC) can be as much as
six orders of magnitude higher than its counterpart coded in CPU
[1]. However the rapid increase of non-recurring engineering cost
and design cycle of an ASIC in nanometer technologies makes it
impractical to implement most applications in ASIC. This makes
the Field Programmable Gate Array (FPGA) increasingly more
popular. FPGA is a desirable type of integrated circuit that can be
reconfigured to realize a large range of arbitrary functions according
to customer demands. The programmability makes FPGA a quick and
reusable hardware implementation platform for specific applications.
Compared to ASIC, FPGA has sacrificed its performance to some
extent for programmability [2] but FPGA can still have orders of
magnitude improvement in performance/power efficiency compared
to CPU [1, 3]. The flexibility and performance of FPGA compared to
ASIC and CPU makes FPGA an important component in the realm
of customizable heterogeneous computing platforms [3].
It is well known that the programmable routing structure in FPGA
is the principal source of the performance inferiority of FPGA
when compared to ASIC [4–7]. It is reported that the programmable
interconnects in FPGA can account for up to 90% of the total area [4],
up to 80% of the total delay [5, 6] and up to 85% of the total power
consumption [7]. The study in [2] measures the gap between FPGA
and ASIC. It shows that the area, delay and power consumption of
FPGA are ∼21 times, ∼4 times and ∼12 times as much as those
of ASIC respectively. We see that if FPGA routing structure (the
c
978-1-4577-0994-4/11/$26.00 2011
IEEE
This paper presents a novel FPGA architecture with memristorbased reconfiguration (mrFPGA). With the novel use of memristors
in mrFPGA interconnects rather than memory cells, the proposed
architecture reduces the gap between FPGAs and ASICs in area,
delay and power.
This paper is organized as follows. Section II introduces the
conventional FPGA architecture and review recent research work
on FPGAs with emerging technologies. Section III presents the
unique structure of memristor and its electrical characteristics and
circuit model. Section IV describes the architecture of mrFPGA
from high-level overview to detailed design. Section V provides a
complete CAD flow and evaluation method for mrFPGA. Section VI
presents detailed evaluation results of mrFPGA using the largest
twenty MCNC benchmarks. Finally we draw some conclusions in
Section VII.
1
II. BACKGROUND AND R ELATED W ORK
A. Conventional FPGA
Fig. 1 shows a conventional FPGA architecture. It is made up of
3D-FPGA
Memory Layer
Switch Layer
CMOS Layer
SB
CB
CB
SB
CB
SB
LB LB LB
CB
LB LB LB
SB
LB LB LB
(a) Monolithic stacking [17].
(b) 3D integration with TSVs [18].
Fig. 1: Conventional FPGA architecture.
an array of tiles, and each tile consists of one logic block (LB), two
connection blocks (CB) and one switch block (SB). Each logic block
contains a cluster of basic logic elements (BLEs), typically lookup tables (LUTs), to provide customizable logic functions. LBs are
connected to the routing channels through CBs and the segmented
routing channels are connected with each other through SBs. Fig. 1
also depicts two typical circuits designs of CBs and SBs based on
multiplexers (MUXs) and buffers described in [16]. The selector pins
of each multiplexer are connected to a group of 6-transistor SRAM
cells to have their connectivity determined. The circuits in Fig. 1
have copies of up to the number of pins per side of LBs in a single
CB, and up to the number of tracks per channel in a single SB.
We see that CBs and SBs make up of the interconnects of FPGAs
with much larger area and higher complexity compared to the direct
interconnects of ASIC.
B. Recent Work on FPGAs using Emerging Technologies
With the recent development of emerging technologies, a number
of novel FPGA architectures based on these technologies have been
proposed in the past few years. A 3D FPGA architecture was proposed in [17]. It partitions the transistors in the routing structure and
logic structure of conventional FPGA into multiple active layers and
stacks the layers via monolithic stacking, a 3D integration technology
the author proposes to use as shown in Fig. 2a. It provides 1.7x
performance gain due to the smaller tile area and shorter interconnect
distance between tiles. However 3D stacking without proper thermal
solution, will result in the excessive increase of heat density and the
corresponding degradation of performance [19]. Also applications
of monolithic stacking are currently limited due to the reason that
high temperature required by the fabrication of transistors in an
upper layer is likely to destroy the metals wires and transistors
which have been fabricated in the layers below [18]. In [14] and
[18] two FPGA architectures are proposed using emerging NVMs
and through-silicon-via (TSV) based 3D integration. Similar to the
monolithic stacking in [17], they also move all the transistors in the
routing structure of FPGA to the top die over the logic structure
die with TSV connections between them as shown in Fig. 2b. In
addition they replace SRAM in FPGA with emerging NVMs, such
as phase-change RAM (PCRAM) [18] or memristor [14], to save the
area usage of storing one programming bit and stand-by power as
shown in Fig. 2c. 1.1x and 2x overall performance gains before and
after 3D integration respectively are reported in these two works.
2
(c) Replace SRAMs with emerging NVMs [14].
(d) Nanowire crossbar [19].
Fig. 2: Recent work on FPGA using emerging technologies.
However one potential problem is that the maximum density of
TSVs currently achievable might fail to satisfy the dense connections
required between the routing structure die and logic structure die.
In [19], a 3D CMOS/Nanomaterial hybrid FPGA was proposed.
Again, it separates the transistors of interconnects from those of logic
blocks and redistributes them into different dies as shown in Fig. 2d.
The difference is that it uses nanowire crossbars and face-to-face
3D integration technology to provide connections between the dies.
It provides a 2.6x performance gain compared to the conventional
FPGA architecture with certain power overhead brought by the large
capacitance of crossbar array. In addition the crossbar structure may
consist of a certain percentage of defective components [19]. To summarize all, these works try to stack interconnects over logic blocks for
the sake of reduction of area and interconnect delay. However they
all suffer from a common problem. They still require transistors in
interconnects. To achieve the stacked architectures, these works have
to reorganize the interconnects and logic blocks into separate dies
and introduce 3D die-stacking (or less mature monolithic stacking).
Otherwise the high temperature during fabrication of the transistors
in an upper layer would ruin the existing transistors and metal wires
below in the same die. Die-stacking introduces undesirable problems
in terms of manufacturability, reliability and limit of the vertical
integration density. Some researchers replace the pass transistors
in the routing structure of FPGA with emerging memory elements
integrated in metal layers with back-end-of-line (BEOL) compatible
fabrication [20, 21]. But the improvement is limited due to the small
portion of pass transistors in the routing structure. There are also
studies on new reconfigurable circuit structures, such as CMOL [22]
and CNTFET-based fine-grain cell matrices [23], with the aim of
taking the place of FPGA. Lots of efforts are still needed to realize
these revolutionary ideas in products.
III. M EMRISTOR T ECHNOLOGY
In this work, we propose to use memristors and metal wires but no
transistors to build up the routing structure in mrFPGA. We first intro-
2011 IEEE/ACM International Symposium on Nanoscale Architectures
duce the memristor technology in this section. CMOS-compatibility
is key property to integrate memristor into the mrFPGA architecture
and several works have demonstrated CMOS-compatible memristor
fabrication processes [10, 24, 25]. For example, Fig. 3 is a schematic
view and cross-sectional transmission electron microscopy (TEM)
image of memristor fabricated by a CMOS-compatible process [24]
for example. As shown in Fig. 3a memristor is a two-terminal
Fig. 4: mrFPGA architecture: The programmable routing structure
is built up with metal wires and memristor which are fabricated on
top of logic blocks in the same die according to existing memristor
fabrication structures.
(a) Schematic view.
(b) TEM image.
Fig. 3: Fabrication structure of memristor integrated with CMOS in
the same die [24].
resistance-like device which can be fabricated between the top metal
layer and the other metal layers on top of the transistor layer in the
same die. As explained in [24, 25], since no high temperature takes
place during their memristor fabrication processes, the processes can
be embedded after back-end-of-line process and will not ruin the
transistors and metal wires which have been fabricated below the
memristors, as shown in Fig. 3b. We also see from Fig. 3b that
the structure of memristor is quite simple, resulting in the easy
achievement of small cell size. In spite of the simple structure of
memristor, it achieves the storage of one-bit data successfully. Memristor can be programmed by applying specific programming voltage
or current at its two terminals to switch its resistance value at normal
operation between low state and high state. The resistance values at
the two states are usually more than two orders of magnitude apart
[24], or even up to six orders of magnitude apart [25]. In addition
memristor can keep the resistance value set before being powered off
without supply voltage. With this programmable property, memristor
is used as a promising type of NVM with two different states: low
resistance state (LRS) and high resistance state (HRS). In our work
the important thing is that memristor acts like a programmable switch
which can be reconfigured to determine whether its two terminals are
connected or not. If the two terminals need to be connected to each
other, we just program the memristor to LRS. Otherwise, we program
the memristor to HRS. This special mechanism is quite different from
those of conventional memories and provides a unique opportunity
to build up a new FPGA routing structure with the novel use of
memristors.
IV. mrFPGA S TRUCTURE
A. Basic Structure
As discussed in the previous section, there is opportunity to build
up an FPGA routing structure based on memristors only in addition
to metal wires. Also memristor-based routing structure of FPGA
will be placed over the transistor layer in the same die, just like
the routing structure of ASICs. The basic schematic view of this
efficient FPGA architecture with memristor-based reconfiguration
(mrFPGA) is shown in Fig. 4. With the organization of logic blocks
and programmable interconnects in Fig. 4, the overall mrFPGA
architecture will become a highly compact array as shown in Fig. 5a.
From the figure we see that the area of mrFPGA will be reduced
to the total area of the logic blocks only, which takes only 10%
to 20% of the conventional FPGA area [4]. Fig. 5b is a detailed
design of the connection blocks and switch blocks in mrFPGA using
memristors and metal wires only. In this design we use five metal
layers M5 to M9 without resource conflict. The circuits of logic
blocks in mrFPGA will use the metal layers below them, i.e., M1
to M4, which are sufficient for practical FPGAs, e.g. Virtex-6 [26].
The memristor layer is designed to locate between the M9 and M8
layers. We see that the positions of the layers are in the same order as
the memristor fabrication structure reported in [24] so as to guarantee
the feasibility of mrFPGA fabrication. Though memristors are located
close to the top layer in this design, electrical signals do not always
have to go through all metal layers to reach memristors, since switch
blocks are in M5 ∼ M9. The signals need to reach M1 or M2 from
the memristor layer only when they access the pins of logic blocks.
Moreover, as stated in Section I, the size of a memristor cell can
be close to or even smaller than the width of a metal wire [9, 10].
So memristors can easily fit into the design in Fig. 5b without extra
area overhead. The memristor array can be programmed efficiently
using V /2 biasing scheme and two-step write operation as proposed
in [15].
B. Interconnects with Adaptive Buffer Insertion
One potential problem of the basic mrFPGA routing structure is
that there are no buffers in the structure. It leads to a quadratic
increase of interconnect delay with the increase of the interconnect
distance. This problem would undermine the benefit brought by the
significant reduction of the overall area and interconnect delay in
the case of large FPGAs. To address this problem, we propose an
improved FPGA architecture which allows adaptive buffer insertion
in the interconnects as shown in Fig. 6. A limited number of buffers
are prefabricated in routing channels and can be connected to the
tracks in channels via memristors. Whether to insert the prefabricated
buffers or not and which track uses a buffer depend on the placement
result of the circuit to implement on mrFPGA. For a routing path
long enough to exhibit the quadratic increase of RC delay, we will
let some buffers be connected to the path at proper positions. To get
the optimal solution of insertion choice for each buffer in the routing
network, we borrow an idea from the buffer placement algorithm in
ASIC [27]. This algorithm can decide the best positions of buffers in
the interconnects of an ASIC using dynamic programming to achieve
the minimum interconnect delay. It constructs delay/capacitance pairs
which correspond to different buffer options in a routing tree of an
ASIC via a depth-first search and produces the optimal option in
time complexity of O(B 2 ) where B is the number of legal buffer
positions. For mrFPGA applications, we let the buffer at certain
position be inserted to the routing track if the algorithm decides to
2011 IEEE/ACM International Symposium on Nanoscale Architectures
3
(a)
(b)
Fig. 5: The proposed mrFPGA architecture. (a) Overview of mrFPGA where the FPGA area is contributed by logic blocks only, and (b) A
detailed design of the connection blocks and switch blocks in mrFPGA using the memristors and metal wire resources in existing memristor
fabrication structures.
Fig. 6: An improved mrFPGA architecture with adaptive buffer
insertion in the programmable interconnects.
place a buffer at this position. Then whether or not to insert one
buffer at one routing node in mrFPGA is optimized on demand. In
contrast in conventional FPGA the connections between buffers and
routing tracks are predetermined during fabrication. A case study in
Fig. 7 shows the benefit brought by the adaptive buffer insertion in
mrFPGA when compared to the fixed buffer pattern in conventional
FPGA. To simplify the problem, we assume in this case that the RC
delay of a wire with the length of one block is 0.5Rwire Cwire = 1ns,
and that the buffers in interconnects are considered as ideal buffers
with infinite drive, no input/output capacitance, and a fixed intrinsic
delay of 9ns. The optimal length of a wire between two adjacent
buffers would be
r
Tbuffer
=3
k=
0.5Rwire Cwire
In conventional FPGA, the pattern of pre-fabricated buffers in interconnects will follow this result to distribute buffers evenly with a
distance of three blocks (as shown in Fig. 7). The problem is that
it does not know the starting point of each net (the offset can be
0, 1 or 2). So it just staggers the patterns among different tracks
4
Fig. 7: Comparison between fixed buffer patten in conventional FPGA
and adaptive buffer insertion in mrFPGA.
in the same channel as shown in Fig. 7. When the output pin of
a logic block happens to be connected to a wire segment that is
buffered right after the connection node, just like the block “start”
in Fig. 7 connected to track3 in the upper channel, the routing result
will deviate from the optimal. The problem can be even worse during
the switch from one track to another. A timing-driven design, such
as [28], has to always buffer the switches between tracks to avoid the
potential large RC delay caused by unbuffered connection between
the two longest unbuffered track segments, in this case as large as
36ns if the two track1s in the channels are connected without a buffer
for example. It drives the routing result farther from the optimal. The
table in Fig. 7 lists all the possible delays from the logic block “start”
to the logic block “end” according to the settings in a conventional
FPGA discussed above. The first column is the track ID in the upper
channel, and the first row is the track ID in the right channel. The
buffers at the output pin of the “start” block and the input pin of
the “end” block are not counted in the delay calculation. As shown
2011 IEEE/ACM International Symposium on Nanoscale Architectures
in Fig. 7 the worst case can be 22% slower than the best case in a
conventional FPGA. On the contrary, the adaptive buffer insertion in
mrFPGA will not suffer from the deviation from the optimal buffer
placement in interconnects as does the conventional FPGA. We see
from Fig. 7 that in mrFPGA buffers are inserted just exactly one per
three blocks, resulting in a total delay of only 45ns. It is even better
than the best case in a conventional FPGA.
V. CAD F LOW AND E VALUATION OF mrFPGA
A. CAD Flow
To implement a circuit design in FPGA, a complete CAD flow is
needed. The input of the flow is the design to be implemented and the
output of the flow includes the placement and routing (P&R) result
used for programming the design into FPGA as well as an estimation
of area, delay and power consumption for the implementation of the
design in FPGA. We propose to use a timing-driven CAD flow for our
mrFPGA as shown in Fig. 8. A circuit design first goes through the
Fig. 9: Extracted equivalent circuit model of the routing structure of
mrFPGA and comparison with conventional FPGA.
it is to think about the wire segment with the length of one tile.
In mrFPGA the wire segment with the length of one tile is just the
short segment between two logic blocks, of which the length is close
to zero compared to the tile side. However the length of one tile
does not disappear. It is counted in the switch block according to the
mrFPGA architecture shown in Fig. 5b.
C. Shielding Effect of Memristor
During the development of the circuit models for the routing structure
of mrFPGA, we find out that memristor has a shielding effect to
help reduce the interconnect delay of mrFPGA further. The shielding
effect of memristors at a high resistance state (HRS) can remove the
downstream capacitance of unused paths from the total capacitance
at one node and then reduce the RC delay to reach this node. Fig. 10
shows an example of the effect in a switch block when compared
to a conventional FPGA. In the conventional FPGA, in spite of the
Fig. 8: CAD flow for mrFPGA.
design tool ABC [29] to perform logic optimization and technology
mapping. A mapped netlist consisting of K-LUTs will be obtained.
Then the mapped netlist is fed into the design tool T-VPACK [30]
to pack LUTs to logic blocks and then into the design tool mrVPR
developed by us to accomplish P&R. At last we use the generated
basic-cell (BC) netlist with delay and capacitance information to run
the power estimation tool fpgaEVA LP2 [7]. Since we try to reuse
most parts of the existing CAD flow for conventional FPGAs, the
development of mrFPGA CAD flow is quite easy. Only the P&R
tool is specifically developed for mrFPGA. We will introduce the
features of this advanced tool in the last part of this section.
(a) Conventional FPGA.
(b) mrFPGA.
Fig. 10: Memristors provide capacitance shielding from unused paths.
B. Equivalent Circuit Model for Interconnect
To perform the timing and power analysis for mrFPGA, we first
develop an equivalent circuit model for the routing structure of
mrFPGA. Fig. 9 shows the equivalent circuit of a representative
routing path from the output pin of a logic block to the input pin
of another logic block in mrFPGA and makes a comparison with
that of a conventional FPGA. In the circuit model, the MUXs in
connection blocks and switch blocks are replaced by the memristors
at low resistance state (LRS). The wire segments are still modeled
as distributed RC lines. Notice that the overall resistance value
and capacitance value of wire segments have changed due to the
reduction of the tile area in mrFPGA. Also, according to the mrFPGA
architecture in Fig. 5a, the length of a wire segment is one tile shorter
than its counterpart in a conventional FPGA. The easiest way to see
fact that the lower two paths are not used as shown in Fig. 10a, the
downstream capacitances of these two paths are still added to the total
capacitance at node A. In contrast, in mrFPGA only the downstream
capacitance of a path in use will be added to the total capacitance at
node A. The huge resistance value of memristors at HRS provides
capacitance shielding from unused paths. We use HSPICE simulation
to verify whether the resistance of memristor at HRS is high enough
to be qualified as an ideal open switch for the shielding effect. In the
mrFPGA circuit shown in Fig. 10b, we set R as 1×103 Ω, and C and
Cload as 1×10−12 F. For the actual memristor resistance value, we set
Rlrs as 1×103 Ω according to [24, 25] and sweep Rhrs from 1×103 Ω,
the same value as Rlrs , to 1 × 106 Ω. At this starting point of sweep,
the three paths are equivalent and the three downstream capacitances
2011 IEEE/ACM International Symposium on Nanoscale Architectures
5
will be all added to node A in the path from node “start” to node
“end” just like a conventional FPGA. We want to see how large Rhrs
needs to be to make the path delay from node “start” to node “end”
close enough to that in the case where Rhrs are replaced by ideal
open switches. The result obtained by the HSPICE simulations is
shown in Fig. 11. We see that with the increase of Rhrs , the delay
6
shielded by open switch
shielded by memristor
5.5
delay (ns)
5
4.5
Actual Memristor Region
4
3.5
3
10
3
10
4
5
R
hrs
(Ω)
10
10
and provide connections for logic blocks nearby in a different
way from that of a conventional FPGA shown in Fig. 12a.
2) mrVPR can reflect the memristor shielding effect. In VPR all
the device capacitances with prefabricated connections to a
node will be added to the capacitance of that node regardless
of whether the paths which these devices belong to will be
included in the routing result or not. In mrVPR, only when
a routing path is used, the capacitances on that path will be
added to the corresponding routing nodes.
3) mrVPR is integrated with the algorithm to generate the optimal
option for adaptive buffer insertion. We have implemented the
algorithm in mrVPR and it shows where buffers are needed for
insertion in the final routing result as shown in Fig. 13. We
see that the positions for buffers to insert are marked with the
black delta shape. We also see that the interconnect view of
the routing result in Fig. 13b is quite similar to that of ASIC.
We expect to see the good performance of mrFPGA in the next
section.
6
Fig. 11: The impact of the memristor shielding effect on the path
delay.
approaches to the value with Rhrs replaced by ideal open switches.
The resistance value of an actual memristor at HRS is more than two
orders of magnitude higher than that at LRS as reported in [24, 25].
We mark the actual memristor region according to this criterion and
see that the delay within this region is very close to the ideal situation.
This proves that the shielding effect of memristor can help reduce
path delay further in mrFPGA.
D. mrVPR: an advanced P&R Tool for mrFPGA
To deal with the novel routing structure of mrFPGA in the P&R step
of CAD flow, we develop an advanced tool named mrVPR (VPR for
memristor-based reconfiguration) on the base of the commonly-used
FPGA P&R tool VPR [30]. The main contributions of this tool are
(a) VPR for conventional FPGA
(b) Our mrVPR for mrFPGA
Fig. 12: Overview of routing resource in the two tools VPR and
mrVPR.
as follows:
1) mrVPR can deal with the routing graph of mrFPGA as shown
in Fig. 12b. In Fig. 12b the gray blocks are I/O blocks and
logic blocks of FPGA and the diagonal wires are the routing
paths available in switch blocks. We see that the switch blocks
and connection blocks in mrVPR are placed over logic blocks
6
(a) Routing resource view.
(b) Routing result view
Fig. 13: View of the optimal option for buffer insertion in mrVPR.
VI. E XPERIMENTAL R ESULTS
A. Settings
We choose Xilinx Virtex-6 FPGA platform as our conventional FPGA
model baseline in experiments and conduct evaluation of mrFPGA
at the same technology node. The architecture logic specification
of Virtex-6 needed by the CAD flow in Fig. 8 are collected from
Virtex-6 FPGA data sheets (Configurable Logic Block part) [26].
As for the architecture routing specification of Virtex-6, such as the
channel width and the distribution of wire segment lengths (Single,
Double, Quad, Long and Global), are extracted with the help of
FPGA Editor in Xilinx ISE environment. The RC delay values of
each type of wire segments are calculated from the unit resistance
and capacitance values in the lookup tables of ITRS2010 [31]. The
timing parameters of buffers inserted in mrFPGA routing structure,
including driving resistance, input capacitance and intrinsic delay, are
obtained via HSPICE simulation using PTM device models [32, 33].
The memristor model is extracted from the measurement results of
the memristors fabricated by the process in [24], the same process on
which mrFPGA is based. The 20 largest MCNC benchmark circuits
are used as the input of the CAD flow for conventional FPGA and
mrFPGA respectively and comparisons are made on the output of the
CAD flow from the aspects of area, delay and power.
B. Evaluation Results
Fig. 14 shows the comparison of the tile area in three FPGA
architectures: the baseline Virtex-6 FPGA, its mrFPGA counterpart
2011 IEEE/ACM International Symposium on Nanoscale Architectures
without buffer insertion (a fictitious case to show the impact of
buffer insertion) and mrFPGA with buffer insertion respectively. As
(a) Virtex-6 baseline. Total area = 13993μm2 .
(b) mrFPGA unbuffered.
(c) mrFPGA. Total area = 2547μm2 .
Fig. 14: Area comparison of a single tile in three architectures (unit:
μm2 ). (LB: logic block; CB: connection block; SB: switch block;
Buf: pre-fabricated buffers to insert on demand in mrFPGA).
discussed in the previous sections, in the case of mrFPGA unbuffered,
the area contribution of the interconnects is reduced to zero since the
interconnects are purely made up of memristors and metal layers
located on top of logic blocks. For mrFPGA with buffer insertion,
we have to count the area of pre-fabricated buffers in the channels
according to the architecture in Fig. 6. The area contributed by buffer
insertion in mrFPGA is pessimistically calculated as the case of a
uniform distribution of pre-fabricated buffers over channels, with the
number of buffers per channel set as the maximum number of buffers
required by mrVPR to insert in one channel of mrFPGA over the 20
benchmarks. We see in Fig. 14 that the area overhead brought by
pre-fabricated buffers is low compared to the interconnect area of
conventional FPGA. That is mainly due to the on-demand property
of buffer solution in mrFPGA. Results in Fig. 14 shows more than
5.5x saving of total area for mrFPGA. Table I shows the performance
comparison between Virtex-6 baseline and mrFPGA. It shows a small
performance degradation before buffer insertion in mrFPGA and a
2.28x speedup after buffer insertion. The total speedup primarily
stems from the reduction of interconnect delay. In the routing
structure of mrFPGA, the reduction of tile area by ∼5.5x results in
a shortening of wire segments in the programmable interconnects by
∼2.35 times. Then both the resistance and capacitance values of wire
segments decrease by ∼2.35 times, contributing to the total reduction
of RC delays of the wire segments by ∼5.5 times. The shielding
effect of memristor also helps to improve performance. The quadratic
increase of the RC delay of a pure RC network cancels out the benefit
in mrFPGA without buffer insertion. We see in Table I that for some
benchmarks with long interconnect paths between logic blocks, the
pure RC delay without buffers is kind of large. However in mrFPGA
with buffer insertion the interconnects perform much better than
Virtex-6 baseline and mrFPGA unbuffered. The good performance
also stems from the adaptive buffer insertion in mrFPGA. Table II
shows the power consumption by Virtex-6 basline and mrFPGA
respectively. An average of 40% power savings is achieved. The
power savings primarily come from the replacement of transistors in
the programmable interconnects, such as MUXs and SRAMs, with
the simple structure of memristors. Both dynamic power and static
power are reduced due to less capacitance on the routing paths and
the non-volatility of memristor. Notice that no die-stacking is needed
to achieve this degree of area reduction and speedup. If 3D integration
technology is introduced in the future to stack several mrFPGAs
together, at least two more times of improvements in density and
speedup, i.e. 10x and 4.5x respectively, are both expected, according
to the experimental results on 3D architecture in [17].
VII. C ONCLUSIONS
This work presents a novel FPGA architecture with memristor-based
reconfiguration named mrFPGA. The programmable interconnects of
mrFPGA use memristors and metal wires only. The routing structure
is then able to be placed over logic blocks in the same die according to
the existing memristor fabrication structure. An improved architecture
with adaptive buffer insertion is proposed to further reduce the
interconnect delay. A complete CAD flow is provided for mrFPGA
with an advanced P&R tool named mrVPR developed for mrFPGA.
The tool can deal with the novel routing structure of mrFPGA,
the memristor shielding effect, and the algorithm of optimal buffer
insertion. An evaluation of mrFPGA is done on the 20 largest MCNC
benchmark circuits. Results show that mrFPGA achieves a 5.18x area
savings, a 2.28x speedup and a 1.63x power savings.
R EFERENCES
[1] P. Schaumont and I. Verbauwhede, “Domain-specific codesign for embedded security,” Computer, vol. 36, no. 4, pp. 68–74, Apr. 2003.
[2] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,”
IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, vol. 26, no. 2, pp. 203–215, Feb. 2007.
[3] J. Cong, V. Sarkar, G. Reinman, and A. Bui, “Customizable DomainSpecific Computing,” IEEE Design and Test of Computers, vol. 28, no. 2,
pp. 6–15, Mar. 2011.
[4] V. George, “Low Energy Field-Programmable Gate Array,” Ph.D. dissertation, UC Berkeley, 2000.
[5] A. DeHon, “Reconfigurable Architectures for General-Purpose Computing,” Ph.D. dissertation, MIT, Dec. 1996.
[6] E. Ahmed and J. Rose, “The Effect of LUT and ClusterSize on DeepSubmicron FPGA Performance and Density,” IEEE Transactions on
VLSI Systems, vol. 12, no. 3, pp. 288–298, Mar. 2004.
[7] F. Li et al., “Power modeling and characteristics of field programmable
gate arrays,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 24, no. 11, pp. 1712–1724, Nov. 2005.
[8] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The
missing memristor found.” Nature, vol. 453, no. 7191, pp. 80–3, May
2008.
[9] Y. S. Chen et al., “Highly scalable hafnium oxide memory with
improvements of resistive distribution and read disturb immunity,” in
IEDM Technical Digest, Dec. 2009, pp. 1–4.
[10] C.-h. Wang et al., “Three-Dimensional 4F2 ReRAM Cell with CMOS
Logic Compatible Process,” in IEDM Technical Digest, 2010, pp. 664–
667.
[11] S.-s. Sheu et al., “A 5ns Fast Write Multi-Level Non-Volatile 1 K
bits RRAM Memory with Advance Write Scheme,” in VLSI Circuits,
Symposium on, 2009, pp. 82–83.
[12] M. Mahvash and A. Parker, “A memristor SPICE model for designing
memristor circuits,” in MWSCAS, no. 2, 2010, pp. 989–992.
[13] S. Yu, J. Liang, Y. Wu, and H.-S. P. Wong, “Read/write schemes analysis
for novel complementary resistive switches in passive crossbar memory
arrays.” Nanotechnology, vol. 21, no. 46, p. 465202, Oct. 2010.
[14] M. Liu and W. Wang, “rFPGA: CMOS-nano hybrid FPGA using RRAM
components,” in NanoArch, Jun. 2008, pp. 93–98.
[15] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design Implications of
Memristor-Based RRAM Cross-Point Structures,” in DATE, 2011.
[16] G. Lemieux and D. Lewis, “Circuit design of routing switches,” International Symposium on FPGAs, p. 19, 2002.
[17] M. Lin, A. El Gamal, Y.-C. Lu, and S. Wong, “Performance Benefits of
Monolithically Stacked 3-D FPGA,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp.
681–229, Feb. 2007.
[18] Y. Chen, J. Zhao, and Y. Xie, “3D-nonFAR: Three-Dimensional NonVolatile FPGA ARchitecture Using Phase Change Memory,” in ISLPED,
2010, p. 55.
[19] C. Dong, D. Chen, S. Haruehanroengra, and W. Wang, “3-D nFPGA:
A Reconfigurable Architecture for 3-D CMOS/Nanomaterial Hybrid
Digital Circuits,” IEEE Transactions on Circuits and Systems I: Regular
Papers, vol. 54, no. 11, pp. 2489–2501, Nov. 2007.
[20] C. Chen et al., “Efficient FPGAs using nanoelectromechanical relays,”
International Symposium on FPGAs, p. 273, 2010.
[21] P.-e. Gaillardon et al., “Emerging memory technologies for reconfigurable routing in FPGA architecture,” in ICECS. IEEE, Dec. 2010, pp.
62–65.
2011 IEEE/ACM International Symposium on Nanoscale Architectures
7
TABLE I: Critical Path Delay and Comparison (unit: ns)
alu4
apex2
apex4
bigkey
clma
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298
s38417
s38584.1
seq
spla
tseng
average
Virtex-6 Baseline
Interconnect Delay Total Delay
6.62
9.54
7.84
10.4
6.79
9.71
10.9
13.3
12.0
14.6
15.1
17.9
4.11
5.37
9.61
10.9
5.93
8.36
14.2
16.8
7.23
10.1
8.97
9.92
6.20
9.06
11.1
14.4
8.39
11.1
8.39
11.1
8.39
11.1
7.45
10.6
10.8
13.8
4.90
7.27
-
alu4
apex2
apex4
bigkey
clma
des
diffeq
dsip
elliptic
ex1010
ex5p
frisc
misex3
pdc
s298
s38417
s38584.1
seq
spla
tseng
average
Virtex-6 Baseline
Interconnect Power Total Power
34.4
52.0
40.1
58.5
26.5
42.8
49.6
375.
75.0
139.
112.
558.
15.7
32.5
79.9
409.
52.7
157.
61.3
109.
20.6
34.7
28.1
60.5
33.4
50.4
57.3
97.1
16.0
28.6
16.0
28.6
16.0
28.6
37.1
55.6
51.1
87.1
20.4
70.8
-
mrFPGA unbuffered
Interconnect Delay
Total Delay
5.53 (1.19x)
8.76 (1.08x)
6.96 (1.12x)
10.2 (1.01x)
7.76 (0.87x)
10.3 (0.94x)
37.7 (0.29x)
40.1 (0.33x)
22.7 (0.53x)
24.3 (0.60x)
40.8 (0.37x)
43.6 (0.41x)
2.72 (1.50x)
5.09 (1.05x)
27.2 (0.35x)
28.5 (0.38x)
10.9 (0.54x)
13.0 (0.64x)
22.1 (0.64x)
25.0 (0.67x)
5.42 (1.33x)
8.34 (1.21x)
11.9 (0.75x)
13.2 (0.74x)
5.73 (1.08x)
8.65 (1.04x)
18.6 (0.59x)
22.3 (0.64x)
8.29 (1.01x)
10.1 (1.09x)
8.29 (1.01x)
10.1 (1.09x)
8.29 (1.01x)
10.1 (1.09x)
7.84 (0.95x)
10.3 (1.03x)
11.9 (0.91x)
15.1 (0.91x)
7.62 (0.64x)
9.99 (0.72x)
- (0.83x)
- (0.83x)
mrFPGA
Interconnect Delay
Total Delay
1.52 (4.35x)
4.75 (2.00x)
1.52 (5.15x)
5.12 (2.04x)
1.35 (4.99x)
4.58 (2.11x)
3.98 (2.75x)
4.97 (2.68x)
2.80 (4.31x)
5.27 (2.78x)
5.59 (2.71x)
8.02 (2.23x)
0.34 (11.7x)
3.27 (1.63x)
2.52 (3.80x)
4.89 (2.22x)
1.72 (3.43x)
4.09 (2.04x)
3.63 (3.90x)
6.92 (2.42x)
0.76 (9.43x)
4.30 (2.35x)
0.71 (12.5x)
4.87 (2.03x)
1.04 (5.91x)
4.58 (1.97x)
2.25 (4.93x)
6.22 (2.32x)
0.85 (9.86x)
4.27 (2.59x)
0.85 (9.86x)
4.27 (2.59x)
0.85 (9.86x)
4.27 (2.59x)
1.40 (5.30x)
4.63 (2.30x)
1.88 (5.77x)
5.48 (2.52x)
0.94 (5.16x)
3.31 (2.19x)
- (6.29x)
- (2.28x)
TABLE II: Power Consumption and Comparison (unit: mW)
mrFPGA unbuffered
Interconnect Power
Total Power
6.57 (5.24x)
24.1 (2.15x)
7.36 (5.44x)
25.8 (2.26x)
4.37 (6.07x)
20.6 (2.07x)
6.34 (7.83x)
332. (1.13x)
9.24 (8.12x)
73.2 (1.89x)
19.3 (5.79x)
465. (1.19x)
2.16 (7.29x)
18.9 (1.71x)
11.4 (6.97x)
341. (1.20x)
7.28 (7.23x)
111. (1.40x)
7.92 (7.74x)
56.3 (1.94x)
3.40 (6.06x)
17.4 (1.98x)
2.87 (9.76x)
35.2 (1.71x)
6.31 (5.29x)
23.2 (2.16x)
8.48 (6.75x)
48.2 (2.01x)
2.30 (6.95x)
14.9 (1.92x)
2.30 (6.95x)
14.9 (1.92x)
2.30 (6.95x)
14.9 (1.92x)
6.75 (5.49x)
25.2 (2.20x)
8.28 (6.17x)
44.2 (1.96x)
2.50 (8.19x)
52.8 (1.34x)
- (6.81x)
- (1.80x)
[22] D. B. Strukov and K. K. Likharev, “CMOL FPGA: a reconfigurable
architecture for hybrid digital circuits with two-terminal nanodevices,”
Nanotechnology, vol. 16, no. 6, pp. 888–900, Jun. 2005.
[23] P.-E. Gaillardon, F. Clermidy, I. O’Connor, and J. Liu, “Interconnection
scheme and associated mapping method of reconfigurable cell matrices
based on nanoscale devices,” in NanoArch, Jul. 2009, pp. 69–74.
[24] K. Tsunoda et al., “Low Power and High Speed Switching of Ti-doped
NiO ReRAM under the Unipolar Voltage Source of less than 3 V,” in
IEDM Technical Digest, Dec. 2007, pp. 767–770.
[25] R. Huang et al., “Resistive switching of silicon-rich-oxide featuring high
compatibility with CMOS technology for 3D stackable and embedded
applications,” Applied Physics A, pp. 927–931, Mar. 2011.
[26] Xilinx, “Virtex-6 FPGA data sheets.” [Online]. Available:
http://www.xilinx.com/support/documentation/virtex-6.htm
[27] L. van Ginneken, “Buffer placement in distributed RC-tree networks for
minimal Elmore delay,” in ISCAS, 1990, pp. 865–868.
8
mrFPGA
Interconnect Power
10.8 (3.18x)
11.4 (3.50x)
6.77 (3.92x)
12.9 (3.82x)
16.1 (4.65x)
31.5 (3.56x)
3.81 (4.13x)
20.9 (3.82x)
12.9 (4.07x)
13.6 (4.48x)
5.47 (3.77x)
5.25 (5.34x)
10.0 (3.33x)
13.8 (4.14x)
3.81 (4.20x)
3.81 (4.20x)
3.81 (4.20x)
10.9 (3.39x)
13.0 (3.90x)
4.93 (4.14x)
- (3.99x)
Total Power
28.4 (1.83x)
29.9 (1.95x)
23.0 (1.85x)
338. (1.10x)
80.1 (1.73x)
477. (1.16x)
20.6 (1.57x)
350. (1.16x)
117. (1.33x)
62.1 (1.76x)
19.5 (1.77x)
37.6 (1.60x)
26.9 (1.86x)
53.6 (1.81x)
16.4 (1.74x)
16.4 (1.74x)
16.4 (1.74x)
29.4 (1.88x)
49.0 (1.77x)
55.3 (1.28x)
- (1.63x)
[28] I. Kuon and J. Rose, “Area and delay trade-offs in the circuit and
architecture design of FPGAs,” in International Symposium on FPGAs,
2008, p. 149.
[29] “Berkeley Logic Synthesis and Verification Group, ABC: A System
for Sequential Synthesis and Verification, Release 61225.” [Online].
Available: http://www.eecs.berkeley.edu/∼alanmi/abc/
[30] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for DeepSubmicron FPGAs. Norwell: MA:Kluwer, 1999.
[31] “International
Technology
Roadmap
for
Semiconductors,”
Tech.
Rep.,
2010.
[Online].
Available:
http://www.itrs.net/Links/2010ITRS/Home2010.htm
[32] W. Zhao and Y. Cao, “Predictive Technology Model (PTM) website.”
[Online]. Available: http://ptm.asu.edu/
[33] W. Zhao and Y. Cao, “New Generation of Predictive Technology Model
for Sub-45nm Design Exploration,” in ISQED, 2006, pp. 585–590.
2011 IEEE/ACM International Symposium on Nanoscale Architectures
Download