An Asynchronous Self-Routing Adaptive Reconfigurable FPGA M. Ferranti and A. Lodi DEIS, 2, Via Risorgimento, 40136 Bologna, ITALY {mferranti,anlodi}@deis.unibo.it, http://www.micro.deis.unibo.it Analysis: Reconfigurable devices consist of logic-blocks interspersed with multi-context memory. Architecture: We investigate an architecture with all memory information on one side of the array, leaving computation on the other to achieve speed and compactness. The array consists of a column of CLB’s, interconnected by a Self-Routing Network. The network (Omega) routes data packets on the basis of their content. An 32 CLB array is 1.8Kx15K sq.lambda. With asynchronous signalling, logic-blocks are reconfigured as soon as their operands are ready. The delay across one stage of the network is 250ps(2lambda=.25um), with no need of a over GHz clock synchronizing the stages across the system. Low-power adaptive circuits were designed to maximize wire bandwidth. In this way, both logic-blocks and wiring resources are re-used over time reaching optimal hardware exploitation and task independence. Mapping Results: A few benchmark circuits mapped on our architecture show comparable speed even if we use a singlecontext device of minimal area for that circuit. In an SOC environment with fixed-pin constraints this advantage is considerably larger. A Comparison of FPGA Implementations of Two’s Complement Bit-Level and Word-Level Matrix Multipliers Radhika S. Grover, Weijia Shang, and Qiang Li Computer Engineering Department, Santa Clara University, Santa Clara, CA. rgrover@scudc.scu.edu, {wshang,qli}@sunrise.scu.edu We present a novel bit-level architecture that differs from other architectures in that individual bits of a word do not have to be processed as a unit. For example, in matrix multiplication, the carry chain of calculating the product of two numbers is broken by sending the partial sums and carries of the product to the accumulating operation instead of the whole finished word. In contrast, in a word-level matrix multiplier a product of two words has to be computed completely before it can be added to the next word-level product. Bit-level architectures for fixed-point matrix multiplication are proven to be O(log p) times faster than the fastest word-level architecture, where p is the word length. In this paper, we compare our implementation of a bit-level twos complement matrix multiplier with a word-level matrix multiplier composed of library macros. Our results show speedup by a factor of two or more. Corner Turning Interconnect for an FPGA: Motivation and Routing Nicholas Weaver and John Wawrzynek http://www.cs.berkeley.edu/~nweaver/sfra/ We believe it is important for computational-oriented FPGAs to include pipelined interconnect structures while maintaining the well understood Manhattan placement properties. In order to develop such an array, we have devised an alternate switchbox topology: a “corner turn” network. Unlike a conventional switch-box based network, our topology allows for efficiently pipelinable switches, encoded switch points, and fast, polynomial time routing. We have developed the interconnect topology and a global routing tool, and have compared the results produced by VPR for a comparable conventional topology. Our router generates nearly optimally short routes, in polynomial time, at the cost of a greater number of wire channels. 223 Decoder-Driven Switching Matrices in Multicontext FPGAs: Area, Routability and Speed V. Baena-Lecuyer, M. A. Aguirre, A. Torralba and L. G. Franquelo baena@gte.esi.us.es Modern FPGAs use SRAM cells to store the programming bits that drive the switching matrices. The area of these SRAM cells and programmable interconnect can be as large as 90% of the total area. This figure drammatically increases in the case of multicontext FPGAs, where the programming configuration has to be repeated as many times as contexts. The use of decoders or multiplexers at the logic block inputs is widely accepted as an attractive way for saving area without reducing routability. This paper introduces a first order area model in terms of standard FPGA parameters which takes into account the use of decoders to connect logic block output pins to the routing channels. Experimental results are presented, showing that the use of decoders at the logic block outputs in some multicontext FPGAs saves more than 15% of area while preserving routability. Design and Implementation of a Variable Length Packet Switch Board Chan Kim, Ji-Myung Rho, Tae-Whan Yoo and Jong-Hyun Lee ETRI, 161 Kajeong-dong, Yuseong-goo, Taejon, 305-350 Korea ckim@etri.re.kr, http://home.hanmir.com/~ckim6803 A switch board for ATM or variable length packet switching was implemented using a shared memory switch fabric ASIC and Xilinx FPGAs. The switch ASICs provides 20 Gbps throughput for 32 ports with 622Mbps speed using parallel configuration and provides multicast, QoS scheduling, sophisticated back-pressure, and channel grouping function. The FPGAs constitute the first and second stage of the switching and converts the packets from the back plane into parallel units and the switched units from the switch ASIC back into packets with buffering, multiplexing and de-multiplexing, format conversion and synchronization mechanism. The board runs at 100MHz clock and line drivers were eliminated by directly using the FPGAs’ GTLP buffers. The board also has protection switching mechanism. The architecture was devised to reduce the latency and buffer space requirement, and with this architecture the number of components was drastically reduced together with increased performance compared to the former design. Evaluation of Novel FPGA Features for Automotive Multimedia Applications Karl G. Esser, Carsten Oetker, Karlheinz Weiss and Wolfgang Rosenstiel Forschungszentrum Informatik (FZI) Karlsruhe, Germany In the past, embedded systems in the automotive area were mainly used for control applications, where extremely high safety requirements prohibit the use of FPGA. We show, that FPGA have a high potential for use in future car multimedia systems, which combine the challenges from the automotive and consumer market. Especially in the fields of availability and time to market, we exploit the advantages of FPGA solutions. Furthermore, we show a tend toward System On Chip (SoC) based on FPGA architectures. New FPGA features, like on-chip processor cores, offer interesting new possibilities in this direction, but besides the increasing amount of very fast on-chip static RAM, we still notice a lack of sufficient dynamic on-chip memory. These perceptions were acquired during the development of an FPGA implementation of a graphic controller. Our own emulation environment, called SPYDER-VIRTEX-X2E was used for development. 224 Fast Reconfigurable Multiplier for FPGA Sébastien Favard and Mohamed Shawky firstname.lastname@utc.fr In hardwired numerical computations, multiplier is the most important basic operator in terms of delay and die area. Signal and image processing use convolution as core of common operations like filtering, edge detecting, FFT... Most of the current implementations of multipliers are very complex and do not take advantage of the main characteristic of multiply operations in signal processing: multiplying a constant by a variable. In this case, we may use decomposition into adders, with a maximum of N-1 adders for an N bit multiplier. The results can be reduced using Booth encoding to gain N/2 adders. In this paper, we propose a multiplier-decomposition method to reduce the global complexity and hence both delay and die area. Compared to Booth coding decomposition, the resulting mean adders number decreases of 22% for 16 bit multiplier and 24% for 20 bit multiplier. Field Programmable Analog Array Modeling Approach For Fast Prototyping S. Colancon, G. Cambon, L. Torres and C. Dufaza Universite Montpellier II, LIRMM, UMR 5506 CNRS /UM II, 161 Rue ADA, 34392 Montpellier cedex 5, France {colancon, cambon, torres, dufaza}@lirmm.fr With the recent introduction of FPAA components, the concept of analog prototyping is now affordable for analog design. We propose the use of FPAA in conjunction with CAD tools to implement a new analog prototyping environment. This one will have to estimate and validate analog system functionality at different level thanks to behavioral models simulation and/or hardware simulation results. In an analog prototyping flow, simulation and accurate models plays an essential role. It is then necessary to possess languages that are easily readable, fast and accurate to give a first idea of the analog functionality behavior. The emergence of new analog description languages like Verilog-A, allows the possibility we need for describing a complete analog macro-functions library. This one offers three levels modelisation that give us some granularity in the analog simulation accuracy. Comparative results between simulation models and hardware prototyping board validate the accuracy of our models and allow us the possible sharing of a complete analog system between this two levels in a new analog prototyping methodology. An FPGA Architecture with Configurable Multiplier and Carry Units for Improved Arithmetic Performance Kamal Rajagopalan and Peter Sutton School of Computer Science and Electrical Engineering, The University of Queensland, Brisbane QLD 4072 Australia raja@csee.uq.edu.au, p.sutton@csee.uq.edu.au FPGAs are increasingly being applied to DSP applications but are often inefficient compared with dedicated DSP chips, particularly for multiplication-based operations. In this work, an FPGA architecture with Configurable Multiplier and Carry Units is proposed to improve arithmetic performance. Each logic block of the proposed architecture consists of flexible multiplier and carry units which can be configured to efficiently support multiplication, addition and multiply-accumulate operations in serial or parallel form. This is supported by dedicated inter-block interconnect. A programmable multiplexer allows for the efficient implementation of barrel shifters, thus enhancing the efficiency for floating point arithmetic operations as well as for fixed-point arithmetic operations. The proposed architecture is suitable for complex DSP applications, and when compared with the Xilinx XC4000, results in a 75% reduction in logic utilization. 225 FPGA Hardware Synthesis from MATLAB Utilizing Optimized IP Cores Malay Haldar, Anshuman Nayak, Alok Choudhary and Prith Banerjee http://www.MachDesignSystems.com We present a MATLAB compiler to compile algorithms described in MATLAB to an architecture comprising of a processor core and a reconfigurable logic component. The processor core represents an embedded or DSP processor. The reconfigurable logic part represents a FPGA. The compilation process is particularly tuned towards synthesis of signal/image processing and communication applications as they benefit the most from our target architecture. Design reuse is particularly important for such applications as some common functions appear very frequently in such applications. Our compiler can automatically leverage highly optimized intellectual property (IP) cores for the frequently occurring functions. The IP cores are integrated into the synthesized designs in an automated fashion. Along with the hardware descriptions of the IP cores, an executable code is present in the IP database. The executable code actively works with the compiler to generate arbitrary complex interfaces and supply accurate information regarding the area and delay for the various parameterizable IP cores. Implementation of a VME Bus to Internal Bus Bridge FPGA Core Xavier Revés, Antoni Gelonch, J. L. Garcia and Ferran Casadevall Universitat Politècnica de Catalunya (UPC), Departament de Teoria del Senyal i Comunicacions, Jordi Girona 1-3 08034, Barcelona, Spain {xreves,antoni}@xaloc.upc.es garciam@teleline.es,feranc@tsc.upc.es We present an FPGA core implementing the required bridging between the standard asynchronous VME bus and a simple synchronous Internal Bus. This VME-Internal Bus bridge core reduces the complexity of interfacing a complete VME backplane, supporting all the standard defined functionalities (data, interrupt and bus managing). Its use of relatively few resources (about 1000 Logic Elements), its high level of performance, its ability to manage up to 80 Mbytes/sec in the VME side and 200 Mbytes/sec in the Internal Bus side, and its flexibility, makes this core specially interesting in VME systems with high transmission rates. Because VME bus is a mature technology constantly updated and following the state of the art in digital systems, it has a wide acceptance and functionality in many industry-related environments. The use of an FPGA to implement this bridge allows the user to customize and release it depending on the application, thus obtaining a more compact solution. The Machine CEPRA-S Configured for Stream Processing Rolf Hoffmann, Bernd Ulmann, Klaus-Peter Völkmann and Stefan Waldschmidt http://www.informatik.tu-darmstadt.de/MP Stream processing is a very efficient execution model for processing large amounts of data in similar ways without the restrictions of traditional vector architectures. The concept of vector processing has been extended to obtain a more flexible and efficient execution model. In a stream processor data and instruction streams are associated allowing individual processing of data stream elements under control of an instruction stream. A stream processor architecture has been designed which can be configured on the CEPRA-S (a configurable coprocessor consisting of two FPGAs and 10 memory banks thus allowing efficient implementation of different logical architectures requiring a large memory bandwidth.) The stream processing machine is programmed by a special purpose language that is designed with the special requirements of architectures like vector computers or stream processors in mind. Operators in the arithmetic/logic unit of the stream processor can be configured in order to meet the requirements of special applications. 226 Motivation from a Full-Rate Specific Design to a DSP Core Approach for All GSM Vocoders Shervin Sheidaei Hamid Noori Iran University of Science and Technology, Computer Engineering Department sh_sh2@hotmail.com Ahmad Akbari AmirKabir University of Technology Computer Engineering Department hamid_noori@yahoo.com Hosein Pedram Iran University of Science and Technology, Computer Engineering Department akbari@iust.ac.ir AmirKabir University of Technology Computer Engineering Department pedram@ce.aku.ac.ir The GSM is a mobile telephony standard for cellular phones. GSM encodes a speech frame in less than 20ms and carries it over a mixture of full-rate, half-rate and enhanced full-rate channels. A specific architecture was designed for full-rate vocoder which its implementation on an Altera FLEX10K FPGA contains 7100 Logic cells, 59ns clock period and takes 7.22ms to encode a frame. In order to extend this architecture for half-rate and enhanced full-rate vocoders the most optimal solution was an application oriented DSP core having a smaller power consumption than a general purpose DSP, but still preserving its flexibility. The core contains 11 functional units that operate in parallel, and has a two stages pipeline. Implementation on an Altera FLEK10KE FPGA, results 6838 LCs, 49152 memory bits (excluding ROMs), 27.5ns clock period which takes 8.1 ms to encode a frame in the half-rate systems. There are 129500 multiplication and 106500 addition operations in this frame. Netlist Partitioning for Accelerated Verification Systems Joachim Pistorius Michel Minoux Altera Corporation 101 Innovation Drive - MS 2301, San Jose, CA 95134 joachim_pistorius@altera.com Laboratoire d'Informatique de Paris 6 Université Pierre et Marie Curie, 4 place Jussieu F-75252 Paris Cedex 05, France michel.minoux@lip6.fr Accelerated verification requires the implementation of functional design descriptions on hierarchically built hardware systems. Netlist partitioning is used with the aim at minimizing the hardware resource requirements and controlling the computation time while meeting size and pin constraints for each partition at each hierarchical level. Our partitioning strategy is based on an extensive study of basic algorithms. The best-suited algorithms were improved and, if necessary, adapted for partitioning with size and pin constraints. The various algorithms were then combined in order to form composite algorithms dedicated to each hierarchical level. These composite algorithms and their various possible combinations have been evaluated using very large industrial netlists and randomly generated benchmark netlists. The experimental results show significant improvements over the global partitioning results of a state-of-the-art industrial logic emulator in terms of both the number of boards required for the design implementation and the computation time. Proving Safety Properties of FPGAs Adrian Hilton Jon Hall Teleca, 88/89 High Street, Winchester, Hampshire, England adi@suslik.org,http://www.suslik.org/ The Open University, Walton Hall, Milton Keynes, England j.g.hall@open.ac.uk,http://mcs.open.ac.uk/jgh23 FPGAs are increasing in complexity and being used as important components of safety-critical systems. Emerging safety standards require analytic reasoning to demonstrate the safety of FPGAs in such systems. We describe a method which uses a synchronous process algebra to produce formal proof that an FPGA program satisfies safety properties, and demonstrates its use in the specification of safety functions for a safety-critical system. 227 PuMA++: A Fully Automatic Path from Specification to Multi-FPGA-Prototype Klaus Harbich Oliver Bringmann Erich Barke Inst. of Microelectronic Systems University of Hannover, Appelstr. 4D-30167 Hannover, Germany harbich@ims.uni-hannover.de Dept. of Computer Engineering University of Tuebingen Sand 13 D-72076 Tuebingen, Germany bringmann@fzi.de Inst. of Microelectronic Systems University of Hannover, Appelstr. 4D-30167 Hannover, Germany barke@ims.uni-hannover.de In this poster we present a new design flow for efficient hardware implementation of behavioral system specifications at algorithmic level into multi-FPGA (Field-Programmable Gate Arrays) rapid prototyping systems. We discuss the benefits of coupling the high-level synthesis tool CADDY-II and the partitioning and mapping environment PuMA, which is designed for optimized implementation of RT-level (Register-Transfer) netlists into multi-FPGA architectures. With our new approach, rapid prototyping and in-circuit verification in earliest design phases are enabled. Due to short implementation times and precise back annotation accomplished by a close coupling of the tools, more design iterations and thus better design space exploration are possible. RCMAT: A Reconfigurable Coprocessor for Matrix Algorithms A. Amira, A. Bouridane, and P. Milligan School of Computer Science, The Queen's University of Belfast, Belfast BT7 1NN, Northern Ireland {A.Abbes, A.Bouridane, P.Milligan}@qub.ac.uk Recently, computer architectures which combine a reconfigurable coprocessor with a general-purpose microprocessor have been proposed as a solution for some computationally intensive tasks. These architectures are designed to exploit large amount of fine grain parallelism in these computationally intensive applications. Currently, work is in progress at the Queen’s University of Belfast to develop a field programmable gate array based rapid prototyping environment to perform some matrix algorithms including matrix operations, matrix transforms and matrix decompositions. It is the aim of this paper, to describe the environment of the general purpose RCMAT coprocessor. Key aspects of the architecture, together with a prototype software environment are presented. Preliminary performance results and comparisons with similar algorithms implemented on different platforms have shown better performance for RCMAT platform. Reconfigurability in Embedded Microprocessors: A Prototyping Study Sergej Sawitzki, Steffen Köhler and Rainer G. Spallek Inst. of Computer Engineering, Dresden University of Technology, D-01062 Dresden, Germany {sawitzki,stk,rgs}@ite.inf.tu-dresden.de http://www.inf.tu-dresden.de/TU/Informatik/TeI/index_e.html This work introduces a prototyping environment for reconfigurable microprocessors design targeting embedded systems. It differs from the previous approaches in the fact that a systematical way (concerning both hardware and software sides) to design, test and debug a class of reconfigurable computing cores instead of one particular application is discussed. Both static and dynamic reconfiguration approaches were tested during the execution of several algorithmic cores from signal, image processing, data compression and simulation domains. First experiments with a simple 8 bit prototype using a reconfigurable ALU extension for implementing application specific instructions have shown that the processor reconfiguration allows performance gains by a factor 2-28 compared to the non-reconfigurable counterpart for these applications. The speedup factors are depending on the number of executions of the algorithmic core and the reconfiguration time. The study has discovered some directions for further architectural improvements. 228 Systems Prototyping Dedicated to Neural Network Real-Time Image Processing Rolf F. Molz, Paulo M. Engel Fernando G. Moraes Lionel Torres, Michel Robert Inst. de Informática - UFRGS - Caixa Postal, 15064 - POA - RS. CEP: 91501-970, Brazil rolf@dinf.unisc.br, engel@inf.ufrgs.br Fac. de Informática - PUC - Av. Ipiranga, 6681-Prédio 30 / bloco 4 POA - RS. CEP: 90619-900, Brazil moraes@inf.pucrs.br LIRMM -Université Montpellier II 161, rue Ada, 34392 MONTPELLIER Cedex 5, France. {torres,robert}@lirmm.fr The configurable computing research community has been using different examples of image processing operations as computing exercises. Nevertheless, few research efforts have been carried out in conceiving complete systems aimed at implementing vision applications, including both the hardware and the software elements required. The main goal of this work is to show that all those techniques and technologies used by configurable computing community may be used to make up a low-cost system capable of mapping a wide number of computer vision applications in research and industrial environments. This work proposes a portable system using reconfigurable devices (FPGA) and a signal processor (DSP) available in a flexible codesign platform (APTIX) for image processing. This hardware/software implementation is a full stand-alone system, able to execute all required tasks for shape localization and classification. This system can be implemented in a dedicated ASIC, characterizing a system-on-a-chip for image processing. To complete our system, we can connect it to a CMOS Image Sensor circuit in order to include the image acquisition task. The Systolic Ring: A Reconfigurable Systolic Architecture Gille Sassatelli, G. Cambon, Jérome Galy and Lionel Torres Université Montpellier II, LIRMM, UMR 5506 CNRS /UM II, 161 Rue ADA, 34392 Montpellier Cedex 5, France Internet is becoming one of the key features of tomorrow’s communication world. The evolution of mobile phones network, such as UMTS will soon allow everyone to be connected, everywhere. This new network technologies bring the ability to deal not only with classical voice or text messages, but also with improved content: multimedia. At the mobile level, this kind of data oriented content requires highly efficient architectures; and nowadays mobile system-on-chip solution will no longer be able to manage the critical constraints like area, power, and data computing efficiency. We will show why classical FPGA architectures won’t be more adapted to solve these arithmetic oriented application problems; and thus naturally propose a new coarse grain dynamically reconfigurable network, dedicated to data oriented applications such as the one targeted on third generation networks. Principles, realizations and comparative results will be exposed for some classical application, targeted on different architectures. Task Partitioning Between a General Purpose Microprocessor and Reconfigurable Hardware Nitij Mangal Puneet Gupta C. P. Ravikumar nVIDIA Corporation, Inc. nmangal@nvidia.com Mindtree Technologies Pvt. Ltd. puneet_gupta@mindtree.com ravikumar@controlnet.co.in We describe a hardware-software codesign solution for a target architecture consisting of a general-purpose processor and a reconfigurable coprocessor. We consider two different working models for the coprocessor, one in which the outputs of various prospective coprocessor configurations are multiplexed, and a second where the coprocessor supports “need based” reconfiguration via an on-chip configuration cache. The input to the system is a C program, whose call graph is derived and modified so that every instance of a function appears as a separate node. Recursive functions are required to be implemented in software. Similarly, the successor nodes of all hardware-implemented functions are also implemented in hardware so that hardware-software communication interfaces are unnecessary. We estimate task execution times on software/hardware, as well as software-hardware communication overhead. We experimented with both simulated annealing and genetic algorithms for hardware-software partitioning. Hardware implemented functions are pushed up in the reconfiguration order to minimize the number of reconfigurations. We were able to get speedups of over 1.25 for an example of ADPCM in significantly less time than an exhaustive search method. 229 Two-Dimensional 8x8 Fast Cosine Transform Parallel Processor Dr. Anatoly Melnyk, Yury Ermetov and Bohdan Dunets Lviv Polytechnic National University , 79013, Bandera str., 12, Lviv, Ukraine, {aomelnyk,yoerm,dunets}@polynet.lviv.ua, http://www.polynet.lviv.ua The discrete cosine transform (DCT) algorithm is basis of such wide-spread coding standards as JPEG, MPEG, H.261 etc. Two-dimensional DCT computation uses row-column approach including 8x8 data transposition, 8-point DCT, another 8x8 data transposition and 8-point DCT. For 8-point DCT calculation the new parallel fast cosine transform (FCT) algorithm is developed featuring the same number of parallel adding operations instead of sequential adding operations of the usual FCT algorithm. For FCT implementation fully parallel 8-channel pipeline FCT unit was designed. Development of specialized constant multipliers allowed to obtain maximum performance with minimum hardware amount. Data transposition was implemented with specialized sorting memory using registers instead of RAM for obtaining maximum speed. VHDL model of the processor was created and implemented with Xilinx FPGA using the VHDL synthesis tool Synplify for FPGAs and CPLDs synthesis of Synplicity, Inc. Comparing to an existing Xilinx, Inc. parallel 8-channel implementation of 2D 8x8 DCT proposed in this paper parallel processor requires 15% less hardware amount with 65% speed enhancement. A Universal Fault-Tolerant Methodology in SRAM-Based FPGA Systems Yanmei Li, Dongmei Li and Zhihua Wang Department of Electronic Engineering, Tsinghua University, Beijing, 100084, P. R. China {liym, lidm, wangzh}@hannah.ee.tsinghua.edu.cn Contrasted with antifuse-based FPGAs (Field Programmable Gate Arrays), SRAM-based FPGAs are more attractive by offering additional reprogramability and more flexibility. However, in some applications, the SRAMs are faced with great threats, for example, they are susceptible to radiation-induced upsets in aerospace systems. The faults in configurationSRAMs may cause a functional failure in FPGAs and even in the whole system. As an effective solution, a TMR (Triple Modular Redundancy) system and a universal fault-tolerant algorithm are presented in this paper. Fault identification, mitigation and correction are introduced. Moreover, key circuit designs are also described, including two sorts of voter circuits, the output compaction and the scan chain design. For general SRAM-based FPGAs, this fault-tolerant methodology can mitigate the effects of SRAM faults and detect these faults without interrupting the system operation. Its effectiveness is proved through simulation and experiment based on XILINX FPGAs. 230