ECE 506 Reconfigurable Computing http://www.ece.arizona.edu/~ece506 Lecture 3 Reconfigurable Architectures Ali Akoglu Complex Programmable Logic Device ° Hierarchical design against size explosion of PLAs • Combinational logic with Flip Flops (registered output) • Organized into logic blocks connected in an interconnect matrix • Usually enough logic for simple counters, state machines, decoders, etc. Xilinx CoolRunner II CPLD ° PLA and Macrocell combination ° 1.8V device, estimated power consumption of less than 100 micro amps ° Up to 12,000 gates, 512 MacroCells CPLD ° Multiple Function Blocks (FBs) and I/O Blocks (IOBs) • Fully interconnected (FB outputs and input signals to the FB Inputs) • Each FB provides programmable logic 54 inputs,18 outputs. ° The IOB provides buffering for device inputs and outputs. ° Output enable signals drive directly to the IOBs. Function Block ° Comprised of 18 independent macrocells, • Each can implement a combinatorial or registered function. ° Logic within the FB is implemented using a sum-ofproducts representation. • Fifty-four inputs (108 true and complement signals) into the programmable AND-array to form 90 product terms. • Any number of these product terms, can be allocated to each macrocell by the product term allocator. How many product terms would you assign for each Macrocell? Macrocell Product Term Allocator selects: 5 product terms primary data inputs to the OR gate for combinatorial functions, as control inputs (clock, clock enable, set, reset, output en.) configured for a combinatorial or registered function. Product Term Allocator ° Controls how the five direct product terms are assigned to each MC. • For example, all five direct terms can drive the OR function . Product Term Allocator ° Can re-assign other product terms within the FB to increase the logic capacity of a macrocell beyond five direct terms. ° Any macrocell requiring additional product terms can access uncommitted product terms in other macrocells within the FB. ° Up to 15 product terms can be available to a single macrocell with only a small incremental delay (tPTA) Product Term Allocator Product Term Allocator ° Can re-assign product terms from any macrocell within the FB by combining partial sums of products over several macrocells • What is the incremental delay in this example 2tPTA If all 90 product terms are available to any macrocell, what is the maximum incremental delay? Programmability Options ° PLDs, CPLDs have different types of programmability. • initial programming and reprogramming ° One-time programmable: • • • • device is programmed once and holds its programming "forever" usually uses fuses to make/break links not reusable, but usually the cheapest discard device if changes are to be made Programmability Options ° UV-Erasable (EPROM) • a floating gate positioned between regular MOS transistor control gate and the channel. • floating gate is uncharged ° To program the cell: • a high voltage (e.g. 14 volts) applied to the control gate (drain is at ~12 volts). • causes current to flow between the source and drain. • accelerates electrons to high velocity and a small fraction of them traverse the thin oxide and become trapped on the floating gate. • floating gate, surrounded by an insulating layer, becomes “permanently” negatively charged and the transistor is permanently turned off. ° “Permanent” means about 10 years at 125 degrees C; at higher temperatures this time is reduced. ° Cells erased by Ultra-Violet (UV) light. • electrons on floating gates are excited and discharged to the substrate. Programmability Options ° Electrically Erasable (EEPROM) • uses a floating gate structure with a control gate on top. • both erasing and reprogramming is accomplished with an electrical current • device can be programmed/erased on circuit board, no special packaging or IC socket is needed • erase time is much faster than UV erase • programming retained after power down - non-volatile • programming/erasing limited to 1000s of cycles Programmability Options ° Electrically Erasable: • both erasing and reprogramming is accomplished with an electrical current • device can be programmed/erased on circuit board, no special packaging or IC socket is needed • erase time is much faster than UV erase • programming retained after power down - non-volatile • programming/erasing limited to 1000s of cycles Electrically Erasable PLDs ° Conventional PLDs are either • One-time programmable • UV Erasable ° Must be placed in a programmer to program them ° EE PLDs can be programmed and erased in place • A small (four wire) connection to a computer is needed • Once programmed, will retain program indefinitely • Never have to take the chip out of its circuit FPGA ° Introduced in 1985 by Xilinx ° Similar to CPLDs ° A function to be implemented in FPGA • Partitioned into modules , each implemented in a logic block. • Logic blocks connected with the programmable interconnection. FPGA Technology ° 1) Antifuse-based • Realization of interconnections ° 2) Memory-based. • realization of interconnections and computation • FLASH, SRAM FPGA Technology ° Antifuse FPGAs: • configured by burning a set of fuses. • once configured, cannot be altered any more • bug fixes and updates possible for new PCBs, but hardly for already manufactured boards. • ASIC replacement for small volumes. ° Flash FPGAs • may be re-programmed several thousand times and are non-volatile • Expensive, re-configuration takes several seconds ° SRAM FPGAs • dominating technology • unlimited re-programming • additional circuitry is required to load the configuration into the FPGA after power on • re-configuration is very fast, • Some devices allow even partial re-configuration during operation Antifuse (Actel FPGA) ° An antifuse is normally an open circuit. ° Two-terminal elements connected to upper and lower layer of the antifuse, in the middle is a dielectric (OxygenNitrogen-Oxygen, ONO) layer ° Initial state: • High resistance of dielectric does not allow any current to flow. ° Applying a high voltage: • causes large power dissipation and melts the dielectric • Drastically reduces the resistance • a link can be built, which permanently connects the two layers. Antifuse chips ° Advantage ! • Small area - With metal-to-metal anti-fuses, no silicon area is required to make connections, decreasing the area overhead of programmability. • Much lower resistance and parasitic capacitance over transistors. - possible to include more switches per device - reduces the RC delays in the routing. • No bitstream can be intercepted in the field (no bitstream transfer) - Need a Scanning Electron Microscope to try to know antifuse states (an Actel AX2OOO antifuse FPGA contains 53 million antifuses with only 2-5% programmed in an average design) • Interconnect structure is naturally “rad hard,” - relatively immune to the effects of radiation (except flip-flops!), - SRAM-based component can be “flipped” if hit by radiation Antifuse chips ° Disadvantage ! • not suitable for devices that must be frequently reprogrammed • one-time programmable FPGAs. • special programmers must be used to program a device before it is mounted on a final product • involves significant changes to the properties of the materials in the fuse, - leads to scaling challenges when new IC fabrication processes are considered Programmability Options ° Static Random Access Memory (SRAM) Programming: • Switch is a pass transistor controlled by the state of the SRAM bit • Logic block configuration bits are stored in SRAM • can be reprogrammed infinite number of times • use of standard CMOS process technology - SRAM cells are created using exactly the same CMOS technologies as the rest of the device, - No special processing steps are required in order to create these components. - benefit from the increased integration, higher speeds and lower dynamic power consumption of new processes with smaller minimum geometries. Programmability Options ° SRAM Volatility • programming contents NOT retained after power down • external non-volatile memory device required on power up ° SRAM Size • SRAM cell requires either 5 or 6 transistors and the programmable element used to interconnect signals requires at least a single transistor. ° SRAM Security • Since the configuration information must be loaded into the device at power up, there is the possibility that the configuration information could be intercepted and stolen for use in a competing system. Programmability Options ° Flash Programming: • alternative that addresses some of the shortcomings of SRAM ° Use of floating gate programming technologies • inject charge onto a gate that “floats” above the transistor. ° Non-volatile • eliminates the need for the external storage for configuration data • can function immediately upon power-up ° Area efficiency • Area overhead: The programming circuitry (high and low voltage buffers) needed to program the cell, • Cost is relatively modest as it is amortized across numerous programmable elements. Programmability Options ° Cannot be reprogrammed an infinite number of times. • Charge buildup in the oxide eventually prevents a flash-based device from being properly erased and programmed ° Non-standard CMOS process. • around five additional process steps on top of standard CMOS • behind SRAM-based devices by one or more generations. ° Programming time is about three times that of an SRAMbased component. ° High resistance and capacitance due to the use of transistor-based switches. ° Solution: on-chip flash memory to provide non-volatile storage with SRAM cells to control the programmable elements in the design. Programmability Options ° An ideal technology • • • • non-volatile reprogrammable using a standard CMOS process offer low on resistances and low parasitic capacitances. FPGA Components ° How can we implement any circuit in an FPGA? • Example: Half adder - Combinational logic represented by truth table What kind of hardware can implement a truth table? Input Out Input Out A B S A B C 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1 FPGA Components ° Lookup Table (LUT) ° Implement truth table in small memories (LUTs) • Usually SRAM ° A function is implemented by writing all possible values that the function can take in the LUT A B S A B C 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 0 1 1 0 1 1 1 0 2-input, 1-output LUTs 00 00 Addr Addr ° The inputs values are used to address the LUT and retrieve the value of the function corresponding to the input values 0 A 1 01 A 0 01 B 1 10 B 0 10 0 11 1 Output Output S 11 C FPGA Components ° Alternatively, could have used a 2-input, 2-output LUT • Outputs commonly use same inputs 0 0 00 Addr Addr A 1 01 A 0 01 B 1 10 B 0 10 0 11 S 0 0 00 A 1 0 01 B 1 0 10 0 1 S C 00 1 Addr 11 C 11 FPGA Components ° Slightly bigger example: Full adder • Combinational logic can be implemented in a LUT with same number of inputs and outputs - 3-input, 2-ouput LUT 3-input, 2-output LUT Truth Table Inputs Outputs 0 0 A B Cin S Cout A 1 0 0 0 0 0 0 B 1 0 0 0 1 1 0 0 1 1 0 0 1 0 1 1 1 S Cout 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 Cin FPGA Components ° LUT Example: Implement the function ABD+BCD+ABC • 2-input LUTs • 3-input LUTs • 4-input LUTs FPGA Components ° LUTs are used as function generators ° How many SRAM locations does a k-input LUT have? ° How many different functions can a k-input LUT 22 implement? Addr 2k k 0 0 00 A 1 0 01 B 1 0 10 0 1 S C 11 FPGA Components ° Why aren’t FPGAs just a big LUT? ° Size of truth table • grows exponentially based on # of inputs • 3 inputs = 8 rows, 4 inputs = 16 rows, 5 inputs = 32 rows, etc. • Same number of rows in truth table and LUT • LUTs grow exponentially based on # of inputs ° Number of SRAM bits in a LUT • = 2i * o • i = # of inputs, o = # of outputs • Example: 64 input combinational logic with 1 output would require 264 SRAM bits - 1.84 x 1019 ° Clearly, not feasible to use large LUTs • So, how do FPGAs implement logic with many inputs? FPGA Components ° Fortunately, we can map circuits onto multiple LUTs • Divide circuit into smaller circuits that fit in LUTs (same # of inputs and outputs) • Example: 3-input, 2-output LUTs FPGA Components ° Large LUTs • Fast when using all inputs • Wastes transistors otherwise ° Must also consider total chip area • Wasting transistors may be ok if there are plenty of LUTs FPGA Components ° What if circuit doesn’t map perfectly? • More inputs in LUT than in circuit - Truth table handles this problem • More outputs in LUT than in circuit - Extra outputs simply not used – Space is wasted, so should use multiple outputs whenever possible ° Important Point • The number of gates in a circuit has no effect on the mapping into a LUT - All that matters is the number of inputs and outputs - Unfortunately, it isn’t common to see large circuits with a few inputs 1 gate 1,000,000 gates FPGA Components ° LUT-Realization ° A LUT is basically a multiplexer that evaluates the truth table stored in the configuration SRAM cells (can be seen as a one bit wide ROM). ° QUIZ2 FPGA Components ° Example: • Determine best LUTs for following circuit - Choices – – - 4-input, 2-output LUT (delay = 2 ns) 6-input, 2-output LUT (delay = 3 ns) Assume each SRAM cell is 6 transistors – – 4-input LUT = 6 * 24 * 2 = 192 transistors 6-input LUT = 6 * 26 * 2 = 384 transistors FPGA Components ° Example: • Determine best LUTs for following circuit - Choices – – - Assume each SRAM cell is 6 transistors – – 6-input LUT 4-input, 2-output LUT (delay = 2 ns) 6-input, 2-output LUT (delay = 3 ns) 4-input LUT = 6 * 24 * 2 = 192 transistors 6-input LUT = 6 * 26 * 2 = 384 transistors Propagation delay = 3 ns Total transistors = 384 FPGA Components ° Example: • Determine best LUTs for following circuit - Choices – – - Assume each SRAM cell is 6 transistors – – 4-input LUT 4-input, 2-output LUT (delay = 2 ns) 6-input, 2-output LUT (delay = 3 ns) 4-input LUT = 6 * 24 * 2 = 192 transistors 6-input LUT = 6 * 26 * 2 = 384 transistors Propagation delay = 4 ns Total transistors = 384 transistors 6-input LUTs are 1.3x faster and use same area FPGA Components ° Problem: How to handle sequential logic • Truth tables don’t work ° Possible solution: • Add a flip-flop to the output of LUT ° BLEs: the basic logic element • Circuit can now use output from LUT or from FF • Where does select come from?