mrFPGA: A Novel FPGA Architecture with Memristor-Based Reconfiguration Jason Cong Bingjun Xiao Department of Computer Science University of California, Los Angeles {cong, xiao}@cs.ucla.edu dominant part in FPGA) gains some improvement, the gap between FPGAs and ASICs will be significantly reduced. Abstract— In this paper, we introduce a novel FPGA architecture with memristor-based reconfiguration (mrFPGA). The proposed architecture is based on the existing CMOS-compatible memristor fabrication process. The programmable interconnects of mrFPGA use only memristors and metal wires so that the interconnects can be fabricated over logic blocks, resulting in significant reduction of overall area and interconnect delay but without using a 3D diestacking process. Using memristors to build up the interconnects can also provide capacitance shielding from unused routing paths and reduce interconnect delay further. Moreover we propose an improved architecture that allows adaptive buffer insertion in interconnects to achieve more speedup. Compared to the fixed buffer pattern in conventional FPGAs, the positions of inserted buffers in mrFPGA are optimized on demand. A complete CAD flow is provided for mrFPGA, with an advanced P&R tool named mrVPR that was developed for mrFPGA. The tool can deal with the novel routing structure of mrFPGA, the memristor shielding effect, and the algorithm for optimal buffer insertion. We evaluate the area, performance and power consumption of mrFPGA based on the 20 largest MCNC benchmark circuits. Results show that mrFPGA achieves 5.18x area savings, 2.28x speedup and 1.63x power savings. Further improvement is expected with combination of 3D technologies and mrFPGA. Conventional FPGA suffers a lot from its programmable interconnects partly due to extensive use of SRAM-based programming bits, multiplexers (or pass transistors), and buffers in the interconnects. One SRAM cell contains as many as six transistors, but can store only one-bit data. The low density of SRAM-based storage increases the area overhead of FPGA programmability and consequently, leads to longer routing paths and larger interconnect delay. In addition SRAM is a type of volatile memory – this means that it contributes to excessive power consumption during stand-by. Emerging technologies, especially emerging non-volatile memory (NVM) technologies, lead to opportunities of circuit improvement. Emerging NVMs typically have a smaller cell size than that of SRAMs. They also have the desirable property of non-volatility, which means that they can be turned off during stand-by to save power. Among them, memristor, also often recognized as resistive random-access memory (RRAM), is considered to be the most promising one. Since HP Labs presented the first experimental realization of memristor in 2008 [8], rapid progress on the fabrication of high-quality memristor has been achieved in the past few years. The current leading memristor technology for the time being has outperformed many other emerging NVMs in some important aspects and performs similar to them in the other aspects. It has been demonstrated that memristor is scalable below 30nm [9], can be fabricated with device size as small as F2 (F is the feature size) with full compatibility of CMOS [10], and can be programmed within 5ns at 180nm technology node [11]. Due to the advantages of its device property over SRAM and other emerging NVMs, numerous research efforts have been made to adopt memristors for systemlevel integration [12–15]. However almost all of these studies use memristors as a new type of memory. Keywords-FPGA; memristor; ASIC; reconfiguration; I. I NTRODUCTION The performance/power efficiency of an application implemented in an application specific integrated circuit (ASIC) can be as much as six orders of magnitude higher than its counterpart coded in CPU [1]. However the rapid increase of non-recurring engineering cost and design cycle of an ASIC in nanometer technologies makes it impractical to implement most applications in ASIC. This makes the Field Programmable Gate Array (FPGA) increasingly more popular. FPGA is a desirable type of integrated circuit that can be reconfigured to realize a large range of arbitrary functions according to customer demands. The programmability makes FPGA a quick and reusable hardware implementation platform for specific applications. Compared to ASIC, FPGA has sacrificed its performance to some extent for programmability [2] but FPGA can still have orders of magnitude improvement in performance/power efficiency compared to CPU [1, 3]. The flexibility and performance of FPGA compared to ASIC and CPU makes FPGA an important component in the realm of customizable heterogeneous computing platforms [3]. It is well known that the programmable routing structure in FPGA is the principal source of the performance inferiority of FPGA when compared to ASIC [4–7]. It is reported that the programmable interconnects in FPGA can account for up to 90% of the total area [4], up to 80% of the total delay [5, 6] and up to 85% of the total power consumption [7]. The study in [2] measures the gap between FPGA and ASIC. It shows that the area, delay and power consumption of FPGA are ∼21 times, ∼4 times and ∼12 times as much as those of ASIC respectively. We see that if FPGA routing structure (the c 978-1-4577-0994-4/11/$26.00 2011 IEEE This paper presents a novel FPGA architecture with memristorbased reconfiguration (mrFPGA). With the novel use of memristors in mrFPGA interconnects rather than memory cells, the proposed architecture reduces the gap between FPGAs and ASICs in area, delay and power. This paper is organized as follows. Section II introduces the conventional FPGA architecture and review recent research work on FPGAs with emerging technologies. Section III presents the unique structure of memristor and its electrical characteristics and circuit model. Section IV describes the architecture of mrFPGA from high-level overview to detailed design. Section V provides a complete CAD flow and evaluation method for mrFPGA. Section VI presents detailed evaluation results of mrFPGA using the largest twenty MCNC benchmarks. Finally we draw some conclusions in Section VII. 1 II. BACKGROUND AND R ELATED W ORK A. Conventional FPGA Fig. 1 shows a conventional FPGA architecture. It is made up of 3D-FPGA Memory Layer Switch Layer CMOS Layer SB CB CB SB CB SB LB LB LB CB LB LB LB SB LB LB LB (a) Monolithic stacking [17]. (b) 3D integration with TSVs [18]. Fig. 1: Conventional FPGA architecture. an array of tiles, and each tile consists of one logic block (LB), two connection blocks (CB) and one switch block (SB). Each logic block contains a cluster of basic logic elements (BLEs), typically lookup tables (LUTs), to provide customizable logic functions. LBs are connected to the routing channels through CBs and the segmented routing channels are connected with each other through SBs. Fig. 1 also depicts two typical circuits designs of CBs and SBs based on multiplexers (MUXs) and buffers described in [16]. The selector pins of each multiplexer are connected to a group of 6-transistor SRAM cells to have their connectivity determined. The circuits in Fig. 1 have copies of up to the number of pins per side of LBs in a single CB, and up to the number of tracks per channel in a single SB. We see that CBs and SBs make up of the interconnects of FPGAs with much larger area and higher complexity compared to the direct interconnects of ASIC. B. Recent Work on FPGAs using Emerging Technologies With the recent development of emerging technologies, a number of novel FPGA architectures based on these technologies have been proposed in the past few years. A 3D FPGA architecture was proposed in [17]. It partitions the transistors in the routing structure and logic structure of conventional FPGA into multiple active layers and stacks the layers via monolithic stacking, a 3D integration technology the author proposes to use as shown in Fig. 2a. It provides 1.7x performance gain due to the smaller tile area and shorter interconnect distance between tiles. However 3D stacking without proper thermal solution, will result in the excessive increase of heat density and the corresponding degradation of performance [19]. Also applications of monolithic stacking are currently limited due to the reason that high temperature required by the fabrication of transistors in an upper layer is likely to destroy the metals wires and transistors which have been fabricated in the layers below [18]. In [14] and [18] two FPGA architectures are proposed using emerging NVMs and through-silicon-via (TSV) based 3D integration. Similar to the monolithic stacking in [17], they also move all the transistors in the routing structure of FPGA to the top die over the logic structure die with TSV connections between them as shown in Fig. 2b. In addition they replace SRAM in FPGA with emerging NVMs, such as phase-change RAM (PCRAM) [18] or memristor [14], to save the area usage of storing one programming bit and stand-by power as shown in Fig. 2c. 1.1x and 2x overall performance gains before and after 3D integration respectively are reported in these two works. 2 (c) Replace SRAMs with emerging NVMs [14]. (d) Nanowire crossbar [19]. Fig. 2: Recent work on FPGA using emerging technologies. However one potential problem is that the maximum density of TSVs currently achievable might fail to satisfy the dense connections required between the routing structure die and logic structure die. In [19], a 3D CMOS/Nanomaterial hybrid FPGA was proposed. Again, it separates the transistors of interconnects from those of logic blocks and redistributes them into different dies as shown in Fig. 2d. The difference is that it uses nanowire crossbars and face-to-face 3D integration technology to provide connections between the dies. It provides a 2.6x performance gain compared to the conventional FPGA architecture with certain power overhead brought by the large capacitance of crossbar array. In addition the crossbar structure may consist of a certain percentage of defective components [19]. To summarize all, these works try to stack interconnects over logic blocks for the sake of reduction of area and interconnect delay. However they all suffer from a common problem. They still require transistors in interconnects. To achieve the stacked architectures, these works have to reorganize the interconnects and logic blocks into separate dies and introduce 3D die-stacking (or less mature monolithic stacking). Otherwise the high temperature during fabrication of the transistors in an upper layer would ruin the existing transistors and metal wires below in the same die. Die-stacking introduces undesirable problems in terms of manufacturability, reliability and limit of the vertical integration density. Some researchers replace the pass transistors in the routing structure of FPGA with emerging memory elements integrated in metal layers with back-end-of-line (BEOL) compatible fabrication [20, 21]. But the improvement is limited due to the small portion of pass transistors in the routing structure. There are also studies on new reconfigurable circuit structures, such as CMOL [22] and CNTFET-based fine-grain cell matrices [23], with the aim of taking the place of FPGA. Lots of efforts are still needed to realize these revolutionary ideas in products. III. M EMRISTOR T ECHNOLOGY In this work, we propose to use memristors and metal wires but no transistors to build up the routing structure in mrFPGA. We first intro- 2011 IEEE/ACM International Symposium on Nanoscale Architectures duce the memristor technology in this section. CMOS-compatibility is key property to integrate memristor into the mrFPGA architecture and several works have demonstrated CMOS-compatible memristor fabrication processes [10, 24, 25]. For example, Fig. 3 is a schematic view and cross-sectional transmission electron microscopy (TEM) image of memristor fabricated by a CMOS-compatible process [24] for example. As shown in Fig. 3a memristor is a two-terminal Fig. 4: mrFPGA architecture: The programmable routing structure is built up with metal wires and memristor which are fabricated on top of logic blocks in the same die according to existing memristor fabrication structures. (a) Schematic view. (b) TEM image. Fig. 3: Fabrication structure of memristor integrated with CMOS in the same die [24]. resistance-like device which can be fabricated between the top metal layer and the other metal layers on top of the transistor layer in the same die. As explained in [24, 25], since no high temperature takes place during their memristor fabrication processes, the processes can be embedded after back-end-of-line process and will not ruin the transistors and metal wires which have been fabricated below the memristors, as shown in Fig. 3b. We also see from Fig. 3b that the structure of memristor is quite simple, resulting in the easy achievement of small cell size. In spite of the simple structure of memristor, it achieves the storage of one-bit data successfully. Memristor can be programmed by applying specific programming voltage or current at its two terminals to switch its resistance value at normal operation between low state and high state. The resistance values at the two states are usually more than two orders of magnitude apart [24], or even up to six orders of magnitude apart [25]. In addition memristor can keep the resistance value set before being powered off without supply voltage. With this programmable property, memristor is used as a promising type of NVM with two different states: low resistance state (LRS) and high resistance state (HRS). In our work the important thing is that memristor acts like a programmable switch which can be reconfigured to determine whether its two terminals are connected or not. If the two terminals need to be connected to each other, we just program the memristor to LRS. Otherwise, we program the memristor to HRS. This special mechanism is quite different from those of conventional memories and provides a unique opportunity to build up a new FPGA routing structure with the novel use of memristors. IV. mrFPGA S TRUCTURE A. Basic Structure As discussed in the previous section, there is opportunity to build up an FPGA routing structure based on memristors only in addition to metal wires. Also memristor-based routing structure of FPGA will be placed over the transistor layer in the same die, just like the routing structure of ASICs. The basic schematic view of this efficient FPGA architecture with memristor-based reconfiguration (mrFPGA) is shown in Fig. 4. With the organization of logic blocks and programmable interconnects in Fig. 4, the overall mrFPGA architecture will become a highly compact array as shown in Fig. 5a. From the figure we see that the area of mrFPGA will be reduced to the total area of the logic blocks only, which takes only 10% to 20% of the conventional FPGA area [4]. Fig. 5b is a detailed design of the connection blocks and switch blocks in mrFPGA using memristors and metal wires only. In this design we use five metal layers M5 to M9 without resource conflict. The circuits of logic blocks in mrFPGA will use the metal layers below them, i.e., M1 to M4, which are sufficient for practical FPGAs, e.g. Virtex-6 [26]. The memristor layer is designed to locate between the M9 and M8 layers. We see that the positions of the layers are in the same order as the memristor fabrication structure reported in [24] so as to guarantee the feasibility of mrFPGA fabrication. Though memristors are located close to the top layer in this design, electrical signals do not always have to go through all metal layers to reach memristors, since switch blocks are in M5 ∼ M9. The signals need to reach M1 or M2 from the memristor layer only when they access the pins of logic blocks. Moreover, as stated in Section I, the size of a memristor cell can be close to or even smaller than the width of a metal wire [9, 10]. So memristors can easily fit into the design in Fig. 5b without extra area overhead. The memristor array can be programmed efficiently using V /2 biasing scheme and two-step write operation as proposed in [15]. B. Interconnects with Adaptive Buffer Insertion One potential problem of the basic mrFPGA routing structure is that there are no buffers in the structure. It leads to a quadratic increase of interconnect delay with the increase of the interconnect distance. This problem would undermine the benefit brought by the significant reduction of the overall area and interconnect delay in the case of large FPGAs. To address this problem, we propose an improved FPGA architecture which allows adaptive buffer insertion in the interconnects as shown in Fig. 6. A limited number of buffers are prefabricated in routing channels and can be connected to the tracks in channels via memristors. Whether to insert the prefabricated buffers or not and which track uses a buffer depend on the placement result of the circuit to implement on mrFPGA. For a routing path long enough to exhibit the quadratic increase of RC delay, we will let some buffers be connected to the path at proper positions. To get the optimal solution of insertion choice for each buffer in the routing network, we borrow an idea from the buffer placement algorithm in ASIC [27]. This algorithm can decide the best positions of buffers in the interconnects of an ASIC using dynamic programming to achieve the minimum interconnect delay. It constructs delay/capacitance pairs which correspond to different buffer options in a routing tree of an ASIC via a depth-first search and produces the optimal option in time complexity of O(B 2 ) where B is the number of legal buffer positions. For mrFPGA applications, we let the buffer at certain position be inserted to the routing track if the algorithm decides to 2011 IEEE/ACM International Symposium on Nanoscale Architectures 3 (a) (b) Fig. 5: The proposed mrFPGA architecture. (a) Overview of mrFPGA where the FPGA area is contributed by logic blocks only, and (b) A detailed design of the connection blocks and switch blocks in mrFPGA using the memristors and metal wire resources in existing memristor fabrication structures. Fig. 6: An improved mrFPGA architecture with adaptive buffer insertion in the programmable interconnects. place a buffer at this position. Then whether or not to insert one buffer at one routing node in mrFPGA is optimized on demand. In contrast in conventional FPGA the connections between buffers and routing tracks are predetermined during fabrication. A case study in Fig. 7 shows the benefit brought by the adaptive buffer insertion in mrFPGA when compared to the fixed buffer pattern in conventional FPGA. To simplify the problem, we assume in this case that the RC delay of a wire with the length of one block is 0.5Rwire Cwire = 1ns, and that the buffers in interconnects are considered as ideal buffers with infinite drive, no input/output capacitance, and a fixed intrinsic delay of 9ns. The optimal length of a wire between two adjacent buffers would be r Tbuffer =3 k= 0.5Rwire Cwire In conventional FPGA, the pattern of pre-fabricated buffers in interconnects will follow this result to distribute buffers evenly with a distance of three blocks (as shown in Fig. 7). The problem is that it does not know the starting point of each net (the offset can be 0, 1 or 2). So it just staggers the patterns among different tracks 4 Fig. 7: Comparison between fixed buffer patten in conventional FPGA and adaptive buffer insertion in mrFPGA. in the same channel as shown in Fig. 7. When the output pin of a logic block happens to be connected to a wire segment that is buffered right after the connection node, just like the block “start” in Fig. 7 connected to track3 in the upper channel, the routing result will deviate from the optimal. The problem can be even worse during the switch from one track to another. A timing-driven design, such as [28], has to always buffer the switches between tracks to avoid the potential large RC delay caused by unbuffered connection between the two longest unbuffered track segments, in this case as large as 36ns if the two track1s in the channels are connected without a buffer for example. It drives the routing result farther from the optimal. The table in Fig. 7 lists all the possible delays from the logic block “start” to the logic block “end” according to the settings in a conventional FPGA discussed above. The first column is the track ID in the upper channel, and the first row is the track ID in the right channel. The buffers at the output pin of the “start” block and the input pin of the “end” block are not counted in the delay calculation. As shown 2011 IEEE/ACM International Symposium on Nanoscale Architectures in Fig. 7 the worst case can be 22% slower than the best case in a conventional FPGA. On the contrary, the adaptive buffer insertion in mrFPGA will not suffer from the deviation from the optimal buffer placement in interconnects as does the conventional FPGA. We see from Fig. 7 that in mrFPGA buffers are inserted just exactly one per three blocks, resulting in a total delay of only 45ns. It is even better than the best case in a conventional FPGA. V. CAD F LOW AND E VALUATION OF mrFPGA A. CAD Flow To implement a circuit design in FPGA, a complete CAD flow is needed. The input of the flow is the design to be implemented and the output of the flow includes the placement and routing (P&R) result used for programming the design into FPGA as well as an estimation of area, delay and power consumption for the implementation of the design in FPGA. We propose to use a timing-driven CAD flow for our mrFPGA as shown in Fig. 8. A circuit design first goes through the Fig. 9: Extracted equivalent circuit model of the routing structure of mrFPGA and comparison with conventional FPGA. it is to think about the wire segment with the length of one tile. In mrFPGA the wire segment with the length of one tile is just the short segment between two logic blocks, of which the length is close to zero compared to the tile side. However the length of one tile does not disappear. It is counted in the switch block according to the mrFPGA architecture shown in Fig. 5b. C. Shielding Effect of Memristor During the development of the circuit models for the routing structure of mrFPGA, we find out that memristor has a shielding effect to help reduce the interconnect delay of mrFPGA further. The shielding effect of memristors at a high resistance state (HRS) can remove the downstream capacitance of unused paths from the total capacitance at one node and then reduce the RC delay to reach this node. Fig. 10 shows an example of the effect in a switch block when compared to a conventional FPGA. In the conventional FPGA, in spite of the Fig. 8: CAD flow for mrFPGA. design tool ABC [29] to perform logic optimization and technology mapping. A mapped netlist consisting of K-LUTs will be obtained. Then the mapped netlist is fed into the design tool T-VPACK [30] to pack LUTs to logic blocks and then into the design tool mrVPR developed by us to accomplish P&R. At last we use the generated basic-cell (BC) netlist with delay and capacitance information to run the power estimation tool fpgaEVA LP2 [7]. Since we try to reuse most parts of the existing CAD flow for conventional FPGAs, the development of mrFPGA CAD flow is quite easy. Only the P&R tool is specifically developed for mrFPGA. We will introduce the features of this advanced tool in the last part of this section. (a) Conventional FPGA. (b) mrFPGA. Fig. 10: Memristors provide capacitance shielding from unused paths. B. Equivalent Circuit Model for Interconnect To perform the timing and power analysis for mrFPGA, we first develop an equivalent circuit model for the routing structure of mrFPGA. Fig. 9 shows the equivalent circuit of a representative routing path from the output pin of a logic block to the input pin of another logic block in mrFPGA and makes a comparison with that of a conventional FPGA. In the circuit model, the MUXs in connection blocks and switch blocks are replaced by the memristors at low resistance state (LRS). The wire segments are still modeled as distributed RC lines. Notice that the overall resistance value and capacitance value of wire segments have changed due to the reduction of the tile area in mrFPGA. Also, according to the mrFPGA architecture in Fig. 5a, the length of a wire segment is one tile shorter than its counterpart in a conventional FPGA. The easiest way to see fact that the lower two paths are not used as shown in Fig. 10a, the downstream capacitances of these two paths are still added to the total capacitance at node A. In contrast, in mrFPGA only the downstream capacitance of a path in use will be added to the total capacitance at node A. The huge resistance value of memristors at HRS provides capacitance shielding from unused paths. We use HSPICE simulation to verify whether the resistance of memristor at HRS is high enough to be qualified as an ideal open switch for the shielding effect. In the mrFPGA circuit shown in Fig. 10b, we set R as 1×103 Ω, and C and Cload as 1×10−12 F. For the actual memristor resistance value, we set Rlrs as 1×103 Ω according to [24, 25] and sweep Rhrs from 1×103 Ω, the same value as Rlrs , to 1 × 106 Ω. At this starting point of sweep, the three paths are equivalent and the three downstream capacitances 2011 IEEE/ACM International Symposium on Nanoscale Architectures 5 will be all added to node A in the path from node “start” to node “end” just like a conventional FPGA. We want to see how large Rhrs needs to be to make the path delay from node “start” to node “end” close enough to that in the case where Rhrs are replaced by ideal open switches. The result obtained by the HSPICE simulations is shown in Fig. 11. We see that with the increase of Rhrs , the delay 6 shielded by open switch shielded by memristor 5.5 delay (ns) 5 4.5 Actual Memristor Region 4 3.5 3 10 3 10 4 5 R hrs (Ω) 10 10 and provide connections for logic blocks nearby in a different way from that of a conventional FPGA shown in Fig. 12a. 2) mrVPR can reflect the memristor shielding effect. In VPR all the device capacitances with prefabricated connections to a node will be added to the capacitance of that node regardless of whether the paths which these devices belong to will be included in the routing result or not. In mrVPR, only when a routing path is used, the capacitances on that path will be added to the corresponding routing nodes. 3) mrVPR is integrated with the algorithm to generate the optimal option for adaptive buffer insertion. We have implemented the algorithm in mrVPR and it shows where buffers are needed for insertion in the final routing result as shown in Fig. 13. We see that the positions for buffers to insert are marked with the black delta shape. We also see that the interconnect view of the routing result in Fig. 13b is quite similar to that of ASIC. We expect to see the good performance of mrFPGA in the next section. 6 Fig. 11: The impact of the memristor shielding effect on the path delay. approaches to the value with Rhrs replaced by ideal open switches. The resistance value of an actual memristor at HRS is more than two orders of magnitude higher than that at LRS as reported in [24, 25]. We mark the actual memristor region according to this criterion and see that the delay within this region is very close to the ideal situation. This proves that the shielding effect of memristor can help reduce path delay further in mrFPGA. D. mrVPR: an advanced P&R Tool for mrFPGA To deal with the novel routing structure of mrFPGA in the P&R step of CAD flow, we develop an advanced tool named mrVPR (VPR for memristor-based reconfiguration) on the base of the commonly-used FPGA P&R tool VPR [30]. The main contributions of this tool are (a) VPR for conventional FPGA (b) Our mrVPR for mrFPGA Fig. 12: Overview of routing resource in the two tools VPR and mrVPR. as follows: 1) mrVPR can deal with the routing graph of mrFPGA as shown in Fig. 12b. In Fig. 12b the gray blocks are I/O blocks and logic blocks of FPGA and the diagonal wires are the routing paths available in switch blocks. We see that the switch blocks and connection blocks in mrVPR are placed over logic blocks 6 (a) Routing resource view. (b) Routing result view Fig. 13: View of the optimal option for buffer insertion in mrVPR. VI. E XPERIMENTAL R ESULTS A. Settings We choose Xilinx Virtex-6 FPGA platform as our conventional FPGA model baseline in experiments and conduct evaluation of mrFPGA at the same technology node. The architecture logic specification of Virtex-6 needed by the CAD flow in Fig. 8 are collected from Virtex-6 FPGA data sheets (Configurable Logic Block part) [26]. As for the architecture routing specification of Virtex-6, such as the channel width and the distribution of wire segment lengths (Single, Double, Quad, Long and Global), are extracted with the help of FPGA Editor in Xilinx ISE environment. The RC delay values of each type of wire segments are calculated from the unit resistance and capacitance values in the lookup tables of ITRS2010 [31]. The timing parameters of buffers inserted in mrFPGA routing structure, including driving resistance, input capacitance and intrinsic delay, are obtained via HSPICE simulation using PTM device models [32, 33]. The memristor model is extracted from the measurement results of the memristors fabricated by the process in [24], the same process on which mrFPGA is based. The 20 largest MCNC benchmark circuits are used as the input of the CAD flow for conventional FPGA and mrFPGA respectively and comparisons are made on the output of the CAD flow from the aspects of area, delay and power. B. Evaluation Results Fig. 14 shows the comparison of the tile area in three FPGA architectures: the baseline Virtex-6 FPGA, its mrFPGA counterpart 2011 IEEE/ACM International Symposium on Nanoscale Architectures without buffer insertion (a fictitious case to show the impact of buffer insertion) and mrFPGA with buffer insertion respectively. As (a) Virtex-6 baseline. Total area = 13993μm2 . (b) mrFPGA unbuffered. (c) mrFPGA. Total area = 2547μm2 . Fig. 14: Area comparison of a single tile in three architectures (unit: μm2 ). (LB: logic block; CB: connection block; SB: switch block; Buf: pre-fabricated buffers to insert on demand in mrFPGA). discussed in the previous sections, in the case of mrFPGA unbuffered, the area contribution of the interconnects is reduced to zero since the interconnects are purely made up of memristors and metal layers located on top of logic blocks. For mrFPGA with buffer insertion, we have to count the area of pre-fabricated buffers in the channels according to the architecture in Fig. 6. The area contributed by buffer insertion in mrFPGA is pessimistically calculated as the case of a uniform distribution of pre-fabricated buffers over channels, with the number of buffers per channel set as the maximum number of buffers required by mrVPR to insert in one channel of mrFPGA over the 20 benchmarks. We see in Fig. 14 that the area overhead brought by pre-fabricated buffers is low compared to the interconnect area of conventional FPGA. That is mainly due to the on-demand property of buffer solution in mrFPGA. Results in Fig. 14 shows more than 5.5x saving of total area for mrFPGA. Table I shows the performance comparison between Virtex-6 baseline and mrFPGA. It shows a small performance degradation before buffer insertion in mrFPGA and a 2.28x speedup after buffer insertion. The total speedup primarily stems from the reduction of interconnect delay. In the routing structure of mrFPGA, the reduction of tile area by ∼5.5x results in a shortening of wire segments in the programmable interconnects by ∼2.35 times. Then both the resistance and capacitance values of wire segments decrease by ∼2.35 times, contributing to the total reduction of RC delays of the wire segments by ∼5.5 times. The shielding effect of memristor also helps to improve performance. The quadratic increase of the RC delay of a pure RC network cancels out the benefit in mrFPGA without buffer insertion. We see in Table I that for some benchmarks with long interconnect paths between logic blocks, the pure RC delay without buffers is kind of large. However in mrFPGA with buffer insertion the interconnects perform much better than Virtex-6 baseline and mrFPGA unbuffered. The good performance also stems from the adaptive buffer insertion in mrFPGA. Table II shows the power consumption by Virtex-6 basline and mrFPGA respectively. An average of 40% power savings is achieved. The power savings primarily come from the replacement of transistors in the programmable interconnects, such as MUXs and SRAMs, with the simple structure of memristors. Both dynamic power and static power are reduced due to less capacitance on the routing paths and the non-volatility of memristor. Notice that no die-stacking is needed to achieve this degree of area reduction and speedup. If 3D integration technology is introduced in the future to stack several mrFPGAs together, at least two more times of improvements in density and speedup, i.e. 10x and 4.5x respectively, are both expected, according to the experimental results on 3D architecture in [17]. VII. C ONCLUSIONS This work presents a novel FPGA architecture with memristor-based reconfiguration named mrFPGA. The programmable interconnects of mrFPGA use memristors and metal wires only. The routing structure is then able to be placed over logic blocks in the same die according to the existing memristor fabrication structure. An improved architecture with adaptive buffer insertion is proposed to further reduce the interconnect delay. A complete CAD flow is provided for mrFPGA with an advanced P&R tool named mrVPR developed for mrFPGA. The tool can deal with the novel routing structure of mrFPGA, the memristor shielding effect, and the algorithm of optimal buffer insertion. An evaluation of mrFPGA is done on the 20 largest MCNC benchmark circuits. Results show that mrFPGA achieves a 5.18x area savings, a 2.28x speedup and a 1.63x power savings. R EFERENCES [1] P. Schaumont and I. Verbauwhede, “Domain-specific codesign for embedded security,” Computer, vol. 36, no. 4, pp. 68–74, Apr. 2003. [2] I. Kuon and J. Rose, “Measuring the Gap Between FPGAs and ASICs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp. 203–215, Feb. 2007. [3] J. Cong, V. Sarkar, G. Reinman, and A. Bui, “Customizable DomainSpecific Computing,” IEEE Design and Test of Computers, vol. 28, no. 2, pp. 6–15, Mar. 2011. [4] V. George, “Low Energy Field-Programmable Gate Array,” Ph.D. dissertation, UC Berkeley, 2000. [5] A. DeHon, “Reconfigurable Architectures for General-Purpose Computing,” Ph.D. dissertation, MIT, Dec. 1996. [6] E. Ahmed and J. Rose, “The Effect of LUT and ClusterSize on DeepSubmicron FPGA Performance and Density,” IEEE Transactions on VLSI Systems, vol. 12, no. 3, pp. 288–298, Mar. 2004. [7] F. Li et al., “Power modeling and characteristics of field programmable gate arrays,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 11, pp. 1712–1724, Nov. 2005. [8] D. B. Strukov, G. S. Snider, D. R. Stewart, and R. S. Williams, “The missing memristor found.” Nature, vol. 453, no. 7191, pp. 80–3, May 2008. [9] Y. S. Chen et al., “Highly scalable hafnium oxide memory with improvements of resistive distribution and read disturb immunity,” in IEDM Technical Digest, Dec. 2009, pp. 1–4. [10] C.-h. Wang et al., “Three-Dimensional 4F2 ReRAM Cell with CMOS Logic Compatible Process,” in IEDM Technical Digest, 2010, pp. 664– 667. [11] S.-s. Sheu et al., “A 5ns Fast Write Multi-Level Non-Volatile 1 K bits RRAM Memory with Advance Write Scheme,” in VLSI Circuits, Symposium on, 2009, pp. 82–83. [12] M. Mahvash and A. Parker, “A memristor SPICE model for designing memristor circuits,” in MWSCAS, no. 2, 2010, pp. 989–992. [13] S. Yu, J. Liang, Y. Wu, and H.-S. P. Wong, “Read/write schemes analysis for novel complementary resistive switches in passive crossbar memory arrays.” Nanotechnology, vol. 21, no. 46, p. 465202, Oct. 2010. [14] M. Liu and W. Wang, “rFPGA: CMOS-nano hybrid FPGA using RRAM components,” in NanoArch, Jun. 2008, pp. 93–98. [15] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie, “Design Implications of Memristor-Based RRAM Cross-Point Structures,” in DATE, 2011. [16] G. Lemieux and D. Lewis, “Circuit design of routing switches,” International Symposium on FPGAs, p. 19, 2002. [17] M. Lin, A. El Gamal, Y.-C. Lu, and S. Wong, “Performance Benefits of Monolithically Stacked 3-D FPGA,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 26, no. 2, pp. 681–229, Feb. 2007. [18] Y. Chen, J. Zhao, and Y. Xie, “3D-nonFAR: Three-Dimensional NonVolatile FPGA ARchitecture Using Phase Change Memory,” in ISLPED, 2010, p. 55. [19] C. Dong, D. Chen, S. Haruehanroengra, and W. Wang, “3-D nFPGA: A Reconfigurable Architecture for 3-D CMOS/Nanomaterial Hybrid Digital Circuits,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 54, no. 11, pp. 2489–2501, Nov. 2007. [20] C. Chen et al., “Efficient FPGAs using nanoelectromechanical relays,” International Symposium on FPGAs, p. 273, 2010. [21] P.-e. Gaillardon et al., “Emerging memory technologies for reconfigurable routing in FPGA architecture,” in ICECS. IEEE, Dec. 2010, pp. 62–65. 2011 IEEE/ACM International Symposium on Nanoscale Architectures 7 TABLE I: Critical Path Delay and Comparison (unit: ns) alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 s38417 s38584.1 seq spla tseng average Virtex-6 Baseline Interconnect Delay Total Delay 6.62 9.54 7.84 10.4 6.79 9.71 10.9 13.3 12.0 14.6 15.1 17.9 4.11 5.37 9.61 10.9 5.93 8.36 14.2 16.8 7.23 10.1 8.97 9.92 6.20 9.06 11.1 14.4 8.39 11.1 8.39 11.1 8.39 11.1 7.45 10.6 10.8 13.8 4.90 7.27 - alu4 apex2 apex4 bigkey clma des diffeq dsip elliptic ex1010 ex5p frisc misex3 pdc s298 s38417 s38584.1 seq spla tseng average Virtex-6 Baseline Interconnect Power Total Power 34.4 52.0 40.1 58.5 26.5 42.8 49.6 375. 75.0 139. 112. 558. 15.7 32.5 79.9 409. 52.7 157. 61.3 109. 20.6 34.7 28.1 60.5 33.4 50.4 57.3 97.1 16.0 28.6 16.0 28.6 16.0 28.6 37.1 55.6 51.1 87.1 20.4 70.8 - mrFPGA unbuffered Interconnect Delay Total Delay 5.53 (1.19x) 8.76 (1.08x) 6.96 (1.12x) 10.2 (1.01x) 7.76 (0.87x) 10.3 (0.94x) 37.7 (0.29x) 40.1 (0.33x) 22.7 (0.53x) 24.3 (0.60x) 40.8 (0.37x) 43.6 (0.41x) 2.72 (1.50x) 5.09 (1.05x) 27.2 (0.35x) 28.5 (0.38x) 10.9 (0.54x) 13.0 (0.64x) 22.1 (0.64x) 25.0 (0.67x) 5.42 (1.33x) 8.34 (1.21x) 11.9 (0.75x) 13.2 (0.74x) 5.73 (1.08x) 8.65 (1.04x) 18.6 (0.59x) 22.3 (0.64x) 8.29 (1.01x) 10.1 (1.09x) 8.29 (1.01x) 10.1 (1.09x) 8.29 (1.01x) 10.1 (1.09x) 7.84 (0.95x) 10.3 (1.03x) 11.9 (0.91x) 15.1 (0.91x) 7.62 (0.64x) 9.99 (0.72x) - (0.83x) - (0.83x) mrFPGA Interconnect Delay Total Delay 1.52 (4.35x) 4.75 (2.00x) 1.52 (5.15x) 5.12 (2.04x) 1.35 (4.99x) 4.58 (2.11x) 3.98 (2.75x) 4.97 (2.68x) 2.80 (4.31x) 5.27 (2.78x) 5.59 (2.71x) 8.02 (2.23x) 0.34 (11.7x) 3.27 (1.63x) 2.52 (3.80x) 4.89 (2.22x) 1.72 (3.43x) 4.09 (2.04x) 3.63 (3.90x) 6.92 (2.42x) 0.76 (9.43x) 4.30 (2.35x) 0.71 (12.5x) 4.87 (2.03x) 1.04 (5.91x) 4.58 (1.97x) 2.25 (4.93x) 6.22 (2.32x) 0.85 (9.86x) 4.27 (2.59x) 0.85 (9.86x) 4.27 (2.59x) 0.85 (9.86x) 4.27 (2.59x) 1.40 (5.30x) 4.63 (2.30x) 1.88 (5.77x) 5.48 (2.52x) 0.94 (5.16x) 3.31 (2.19x) - (6.29x) - (2.28x) TABLE II: Power Consumption and Comparison (unit: mW) mrFPGA unbuffered Interconnect Power Total Power 6.57 (5.24x) 24.1 (2.15x) 7.36 (5.44x) 25.8 (2.26x) 4.37 (6.07x) 20.6 (2.07x) 6.34 (7.83x) 332. (1.13x) 9.24 (8.12x) 73.2 (1.89x) 19.3 (5.79x) 465. (1.19x) 2.16 (7.29x) 18.9 (1.71x) 11.4 (6.97x) 341. (1.20x) 7.28 (7.23x) 111. (1.40x) 7.92 (7.74x) 56.3 (1.94x) 3.40 (6.06x) 17.4 (1.98x) 2.87 (9.76x) 35.2 (1.71x) 6.31 (5.29x) 23.2 (2.16x) 8.48 (6.75x) 48.2 (2.01x) 2.30 (6.95x) 14.9 (1.92x) 2.30 (6.95x) 14.9 (1.92x) 2.30 (6.95x) 14.9 (1.92x) 6.75 (5.49x) 25.2 (2.20x) 8.28 (6.17x) 44.2 (1.96x) 2.50 (8.19x) 52.8 (1.34x) - (6.81x) - (1.80x) [22] D. B. Strukov and K. K. Likharev, “CMOL FPGA: a reconfigurable architecture for hybrid digital circuits with two-terminal nanodevices,” Nanotechnology, vol. 16, no. 6, pp. 888–900, Jun. 2005. [23] P.-E. Gaillardon, F. Clermidy, I. O’Connor, and J. Liu, “Interconnection scheme and associated mapping method of reconfigurable cell matrices based on nanoscale devices,” in NanoArch, Jul. 2009, pp. 69–74. [24] K. Tsunoda et al., “Low Power and High Speed Switching of Ti-doped NiO ReRAM under the Unipolar Voltage Source of less than 3 V,” in IEDM Technical Digest, Dec. 2007, pp. 767–770. [25] R. Huang et al., “Resistive switching of silicon-rich-oxide featuring high compatibility with CMOS technology for 3D stackable and embedded applications,” Applied Physics A, pp. 927–931, Mar. 2011. [26] Xilinx, “Virtex-6 FPGA data sheets.” [Online]. Available: http://www.xilinx.com/support/documentation/virtex-6.htm [27] L. van Ginneken, “Buffer placement in distributed RC-tree networks for minimal Elmore delay,” in ISCAS, 1990, pp. 865–868. 8 mrFPGA Interconnect Power 10.8 (3.18x) 11.4 (3.50x) 6.77 (3.92x) 12.9 (3.82x) 16.1 (4.65x) 31.5 (3.56x) 3.81 (4.13x) 20.9 (3.82x) 12.9 (4.07x) 13.6 (4.48x) 5.47 (3.77x) 5.25 (5.34x) 10.0 (3.33x) 13.8 (4.14x) 3.81 (4.20x) 3.81 (4.20x) 3.81 (4.20x) 10.9 (3.39x) 13.0 (3.90x) 4.93 (4.14x) - (3.99x) Total Power 28.4 (1.83x) 29.9 (1.95x) 23.0 (1.85x) 338. (1.10x) 80.1 (1.73x) 477. (1.16x) 20.6 (1.57x) 350. (1.16x) 117. (1.33x) 62.1 (1.76x) 19.5 (1.77x) 37.6 (1.60x) 26.9 (1.86x) 53.6 (1.81x) 16.4 (1.74x) 16.4 (1.74x) 16.4 (1.74x) 29.4 (1.88x) 49.0 (1.77x) 55.3 (1.28x) - (1.63x) [28] I. Kuon and J. Rose, “Area and delay trade-offs in the circuit and architecture design of FPGAs,” in International Symposium on FPGAs, 2008, p. 149. [29] “Berkeley Logic Synthesis and Verification Group, ABC: A System for Sequential Synthesis and Verification, Release 61225.” [Online]. Available: http://www.eecs.berkeley.edu/∼alanmi/abc/ [30] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for DeepSubmicron FPGAs. Norwell: MA:Kluwer, 1999. [31] “International Technology Roadmap for Semiconductors,” Tech. Rep., 2010. [Online]. Available: http://www.itrs.net/Links/2010ITRS/Home2010.htm [32] W. Zhao and Y. Cao, “Predictive Technology Model (PTM) website.” [Online]. Available: http://ptm.asu.edu/ [33] W. Zhao and Y. Cao, “New Generation of Predictive Technology Model for Sub-45nm Design Exploration,” in ISQED, 2006, pp. 585–590. 2011 IEEE/ACM International Symposium on Nanoscale Architectures