IMPROVING DESIGN OBSERVABILITY AND CONTROLLABILITY FOR FUNCTIONAL VERIFICATION OF FPGA-BASED CIRCUITS USING DESIGN-LEVEL SCAN TECHNIQUES by Timothy Brian Wheeler A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science Department of Electrical and Computer Engineering Brigham Young University February 2001 Copyright c 2001 Timothy Brian Wheeler All Rights Reserved BRIGHAM YOUNG UNIVERSITY GRADUATE COMMITTEE APPROVAL of a thesis submitted by Timothy Brian Wheeler This thesis has been read by each member of the following graduate committee and by majority vote has been found to be satisfactory. Date Brent E. Nelson, Chair Date Brad L. Hutchings Date Michael J. Wirthlin BRIGHAM YOUNG UNIVERSITY As chair of the candidate’s graduate committee, I have read the thesis of Timothy Brian Wheeler in its final form and have found that (1) its format, citations, and bibliographical style are consistent and acceptable and fulfill university and department style requirements; (2) its illustrative materials including figures, tables, and charts are in place; and (3) the final manuscript is satisfactory to the graduate committee and is ready for submission to the university library. Date Brent E. Nelson Chair, Graduate Committee Accepted for the Department A. Lee Swindlehurst Graduate Coordinator Accepted for the College Douglas M. Chabries Dean, College of Engineering and Technology ABSTRACT IMPROVING DESIGN OBSERVABILITY AND CONTROLLABILITY FOR FUNCTIONAL VERIFICATION OF FPGA-BASED CIRCUITS USING DESIGN-LEVEL SCAN TECHNIQUES Timothy Brian Wheeler Department of Electrical and Computer Engineering Master of Science FPGA devices have become an increasingly popular way to implement hardware due to their flexibility and fast time to market. However, one of their major drawbacks is their limited ability to provide complete functional verification of user designs. Readback, partial reconfiguration, and built-in logic analyzers are all common examples of debug features currently available in many FPGAs, but none of them provide complete observability and controllability of the user design. To take advantage of the full potential of FPGA systems, FPGA development tools need to provide this level of observability and controllability to enable designers to quickly find and remove bugs from their circuits. This work describes the use of design-level scan to provide complete observability and controllability for functional verification of FPGA-based designs. An overview of current debug methods is given. A detailed description of implementing design-level scan is presented. The costs associated with scan are provided, together with strategies on how to reduce those costs. This work will show that design-level scan is a viable option for overcoming the limitations of current functional verification techniques. vi ACKNOWLEDGMENTS I want to express sincere thanks to my early engineering professors, including Dr. Lee Swindlehurst, Dr. Rich Selfridge, Dr. Brent Nelson, and Dr. Doran Wilde, for showing me the error of my ways as a computer science major and helping me make the transition into the flourescent light of computer engineering. I would like to thank Dr. Brent Nelson and Paul Graham for providing me with a research topic for my thesis. I would like to thank the back row for being–well, the back row. In particular, Steve Morrison for his eager desire to plan the lab parties, Major Greg Ahlquist for being the highest ranking officer in the lab, Brett Williams for having a wife who makes awesome poppyseed cake, Russell Fredrickson for sharing his paste with me when I got hungry in nursery, Justin Tripp for giving life to Slaacenstein, Aaron Stewart for turning his desk drawer into a kitchen, and Preston Jackson for his extreme courage (or is it endurance?) in doing analog stuff, on a Mac, next to two smelly refrigerators. I would like to thank the department secretaries for providing us with popcorn during the devotionals and for always being so awesome. I would like to thank my friends who visited me and brought me food after spending many of my waking hours (and a few of my sleeping ones) cooped up in a lab without any windows. I would like to thank Bugzilla for making sure my Inbox was never empty. I would like to thank my old textbooks for providing me with a cool monitor stand. Finally, I would like to thank the air conditioning here in the back row for always being such a blast and for forcing me to go home early when my fingers became too numb to type. viii Contents Acknowledgments vii List of Tables xi List of Figures xiv 1 Introduction 1.1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Current State of Debug 2.1 5 Mechanisms to Increase Controllability and Observability in FPGA Designs 5 2.1.1 Ad Hoc Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Structured Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Configuration Bitstream Readback . . . . . . . . . . . . . . . . . . 8 2.1.4 Design-Level Scan . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.5 Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Implementation of Scan Chain 3.1 1 13 Design-Level Scan Implementation . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 Instrumenting Design Primitives . . . . . . . . . . . . . . . . . . . 14 3.1.2 Storing the Scan Bitstream . . . . . . . . . . . . . . . . . . . . . . 24 3.1.3 Instrumenting The Design Hierarchy . . . . . . . . . . . . . . . . . 24 3.1.4 Optimizing Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 ix 4 Costs of Scan Chain 4.1 27 The Costs of Design-Level Scan . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.1 Scan for Library Modules . . . . . . . . . . . . . . . . . . . . . . 27 4.1.2 Partial Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.3 Scan for Large Designs . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.4 Packing Scan Logic into Existing LUTs . . . . . . . . . . . . . . . 45 4.1.5 Using Dedicated Scan Multiplexors . . . . . . . . . . . . . . . . . 45 4.1.6 Other Cost Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5 Variations of and Alternatives to Scan 5.1 5.2 49 Supplementing Existing Observability and Controllability . . . . . . . . . . 49 5.1.1 Strategies to Increase Controllability . . . . . . . . . . . . . . . . . 50 5.1.2 Strategies to Increase Observability . . . . . . . . . . . . . . . . . 52 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 Other Scan Issues 6.1 59 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1.1 FPGA System-Level Issues . . . . . . . . . . . . . . . . . . . . . 59 6.1.2 Scan Overhead in FPGAs vs. VLSI . . . . . . . . . . . . . . . . . 60 6.1.3 Stopping the Global Clock . . . . . . . . . . . . . . . . . . . . . . 61 7 Conclusions 7.1 63 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 63 Bibliography 66 x List of Tables 4.1 Design-Level Scan Costs for a Few Modules without Optimizations . . . . 29 4.2 Design-Level Scan Costs for a Few Modules with Optimizations . . . . . . 29 4.3 Design-Level Scan Costs for a Few Modules—LUT vs. LE Costs . . . . . 35 4.4 Design-Level Scan Costs for a Few Modules—Full vs. Partial Scan . . . . 38 4.5 Area and Speed of Sample Designs . . . . . . . . . . . . . . . . . . . . . . 39 4.6 Design-Level Scan Costs for Sample Designs . . . . . . . . . . . . . . . . 39 4.7 Logic Packing for Sample Designs . . . . . . . . . . . . . . . . . . . . . . 46 4.8 Design-Level Scan Costs Using a Dedicated Scan Mux . . . . . . . . . . . 47 5.1 Area of Sample Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Design-Level Scan on Only Flip-Flops . . . . . . . . . . . . . . . . . . . . 51 5.3 Cost of Repairing BlockRAM Readback 5.4 Original Area of User Designs . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Area of User Designs w/ Full-Scan . . . . . . . . . . . . . . . . . . . . . . 56 5.6 Best-Case Results for Improving Observability . . . . . . . . . . . . . . . 57 5.7 Best-Case Results for Improving Controllability . . . . . . . . . . . . . . . 57 5.8 Best-Case Results for Improving Observability and Controllability . . . . . 57 6.1 Area Overheads for Clock-Stopping Circuitry . . . . . . . . . . . . . . . . 62 xi . . . . . . . . . . . . . . . . . . 54 xii List of Figures 3.1 Circuit View when ScanEnable is Deasserted . . . . . . . . . . . . . . . . 13 3.2 Circuit View when ScanEnable is Asserted . . . . . . . . . . . . . . . . . . 13 3.3 Instrumenting a Flip-Flop for Scan . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Instrumenting Multiple Flip-Flops for Scan . . . . . . . . . . . . . . . . . 15 3.5 Embedded RAMs Linked in a Scan Chain . . . . . . . . . . . . . . . . . . 16 3.6 Multi-Bit ARSW RAM Instrumented for Scan . . . . . . . . . . . . . . . . 17 3.7 Address Generator for RAM Instrumentation . . . . . . . . . . . . . . . . 18 3.8 16-deep SRL Instrumented for Scan . . . . . . . . . . . . . . . . . . . . . 19 3.9 Sample Circuit Containing Synchronous RAM . . . . . . . . . . . . . . . 20 3.10 First Attempt At Scan-Out for Sample Circuit . . . . . . . . . . . . . . . . 21 3.11 Corrected Scan-Out Operation . . . . . . . . . . . . . . . . . . . . . . . . 21 3.12 Scan-In for Sample Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.13 Synchronous RAM Instrumented for Scan . . . . . . . . . . . . . . . . . . 23 3.14 Shadow Output Register for Synchronous RAM . . . . . . . . . . . . . . . 23 4.1 A 4-Bit Up-Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 A 4-Bit Up-Counter Instrumented for Scan . . . . . . . . . . . . . . . . . . 29 4.3 Conceptual View of a 16x16 Array Multiplier . . . . . . . . . . . . . . . . 30 4.4 A Single Multiplier Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.5 A Single Multiplier Cell Instrumented for Scan . . . . . . . . . . . . . . . 31 4.6 A Fully-Pipelined Rotational CORDIC Unit . . . . . . . . . . . . . . . . . 32 4.7 One CORDIC Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.8 A Fully-Pipelined Rotational CORDIC Unit Instrumented for Scan . . . . . 33 4.9 Folding Scan Logic Into Existing 4-LUTs . . . . . . . . . . . . . . . . . . 34 4.10 Disabling Unscanned FFs for Partial Scan . . . . . . . . . . . . . . . . . . 37 xiii 4.11 Conceptual View of Partial Scan for the Array Multiplier . . . . . . . . . . 37 4.12 LUT Scan Overhead for Instrumenting a Single Virtex BlockRAM . . . . . 41 4.13 Flip-Flop Scan Overhead for Instrumenting a Single Virtex BlockRAM . . 41 5.1 Logic to Fix Readback Problem with Virtex BlockRAMs . . . . . . . . . . 53 6.1 Global Clock Enable Circuitry for Flip-Flops . . . . . . . . . . . . . . . . 61 6.2 Global Clock Enable Circuitry for Flip-Flops without Clock Enables . . . . 61 xiv Chapter 1 Introduction 1.1 Introduction FPGA devices are a popular way to implement hardware in many products. Their reprogrammability gives them a great advantage over conventional ASICs in that it reduces the overall design risk and time to market. For example, when designing an ASIC, most of the work is done on the front end; that is, all of the functional design and verification, timing analysis, etc. are done extensively in software before the design ever reaches silicon. The simulation process in software is slow and meticulous, and the design engineer must go to great lengths to ensure that the design will work correctly and operate at the desired speed when implemented in actual silicon. Once the ASIC design has been extensively simulated and debugged in software, the process of fabricating the ASIC and doing further debug and verification of the silicon and hardware circuit is yet another lengthy and costly step. With FPGAs, however, the hardware is available from the outset; thus, the hardware implementation takes place much sooner, simulation in hardware is much faster than in software, corner cases can be tested more easily, and bugs can be found and removed much more quickly. Their reprogrammability also eliminates the lengthy fabrication step associated with ASICs since designers develop applications directly onto the FPGA hardware. Although ASICs may be more favorable than FPGAs for extremely large and complex designs, such as a computer processor, where FPGAs simply cannot provide the necessary speed and logic required for the design, FPGA-based systems provide invaluable solutions to a wide range of circuit designs. To take advantage of the full potential of FPGA systems, FPGA development tools 1 ought to provide the same level of visibility and controllability as a software simulator to enable designers to quickly perform functional verification of the design. This includes enabling the designer to view and modify the state of the circuit at any given clock cycle. In addition, this capability should be provided automatically, without having to generate new configuration bitstreams every time a designer decides to view different signals. Unfortunately, current FPGA systems and software fall far short of this standard. For instance, visibility is limited or difficult to achieve in all of the current commercially available devices. For some FPGAs the only way to view the circuit state is to route individual internal signals to package pads where they can be accessed by external hardware, such as a logic analyzer. This approach limits the circuit visibility to only those signals that are routed to the package pads. It also requires additional time-consuming runs of the vendor’s place and route software each time the user decides to probe a new signal, which in turn requires the designer to keep track of the many configuration bitstream files associated with the various analyzed signals. Some approaches require the designer to modify the original design description to include embedded logic analysis circuitry, which often leads to unintended modifications resulting in errors. Other FPGAs provide fixed circuitry to serially read out much of the internal state without the need for added user circuitry; however, the entire internal state of the user circuit is not always available, the state sampling mechanism actually modifies the circuit state in some cases, and these techniques usually require the user to stop the external clock, something that experience has shown is often not supported by commercially-available FPGA systems. In addition to providing only limited visibility, none of the vendor-supplied approaches allow the designer to easily modify the current values of flip-flops during operation, which is something very commonly done when debugging a circuit. In general, because of these shortcomings, functional verification of FPGA-based designs is largely an ad hoc process that is much more difficult than it needs to be, and does not exploit the full design power that could otherwise be taken advantage of in FPGAs. Several FPGA vendors have provided additional tools in an attempt to increase the visibility of the circuit state during debug. Some examples include Xilinx’s ChipScope and Altera’s SignalTap features. Both of these tools provide real-time access to any node 2 on the chip, allowing the design to run at full speed during debug. They even allow some of the trigger signals to change without re-running the design through the place and route tools. The downside to this approach, however, is that all of the signals to be traced must be declared up front, before the design is run through the place and route tools. The circuit visibility is limited to only these signals, and adding any more signals to this list requires more time-consuming passes through the place and route software. In addition to providing only limited visibility, tools like ChipScope and SignalTap provide no state-modifying capabilities at all. While such tools can be useful for debugging small portions of a large chip or when performing timing analysis of the chip as it runs at full speed, their ability to provide complete functional verification is very limited. The purpose of this work is to present a systematic approach for providing complete observability and controllability for functional verification of FPGA-based designs. The basic strategy is to use an approach similar to design-level scan—implemented with user circuitry—to provide the user with complete visibility of the circuit state, as well as the ability to load the internal state with known values from an external source during the verification process. This allows the user complete control to view and modify state “variables”, just like in a software debugger. Because the approach is systematic, it can be automated. Using it does not require the modification of the original design specification, thus eliminating a source of error. This approach also allows users to view any part of the circuit state1 without additional runs through the vendor’s place and route software. In addition, since the clock is free running, the circuit can run at full speed until the user is ready to take a snapshot of the circuit state. Finally, because the approach is based on user circuitry, it can be made to work with just about any programmable device. The main downside to this approach, however, is the large area and speed overhead incurred when instrumenting a user design with scan. Fortunately, due to the reprogrammable nature of FPGAs, this overhead penalty is only temporary, as the FPGA can be reprogrammed with the original design without the scan chain once the verification process is complete. 1 Once the state of synchronous circuit elements is known, the values for any combinational portions of the circuit are easy to infer. 3 4 Chapter 2 Current State of Debug 2.1 Mechanisms to Increase Controllability and Observability in FPGA Designs Various methods exist to enhance observability and controllability in FPGA-based designs. Like those described in [1] they can be grouped into two categories: ad hoc methods and structured methods. 2.1.1 Ad Hoc Methods Many different ad hoc methods can be used to debug user circuitry. One such method involves multiplexing internal signals onto external debug ports. These signals can then be viewed using a logic analyzer to help debug the circuit. Other methods include forms of self-test such as signature analysis. This technique uses Cyclic Redundancy Checking (CRC) to verify the results of various test sequences. Unfortunately, these two methods are painful to use and provide limited or no visibility into the actual state of the user circuit. A common approach is for individual FPGA vendors to provide their own built-in logic analyzers, such as Xilinx’s ChipScope and Altera’s SignalTap. Both of these allow real-time access to any node on the chip, as well as the real-time ability to change the trigger conditions. Although the number of signals that can be accessed at any given time is greater than for an external logic analyzer, that number is still limited and the signals must be declared up front, before the design is run through the place and route tools. For Altera, adding new signals to view or making any other modifications to the embedded state analyzer other than changing the trigger conditions requires a full recompilation of 5 the user design, since the modification affects the size and configuration of the hardware. The same is almost true for Xilinx, except that some of the changes can be made using their fpga editor software, in which case only a new configuration bitstream needs to be generated. The advantages of these ad hoc methods is that they carry very little impact on the area and speed of the circuit and they allow the circuit to run at full speed during debug. They are very useful for speed testing of limited areas of the circuit. Unfortunately, ad hoc methods are limited in their ability to provide complete functional verification. First, they are usually design-specific—they require designer intervention to insert and use. What is desired is a structured technique that can be applied automatically and without user intervention to any possible user design. Second, they require the user to specify upfront the desired signals to observe before the design is run through the vendor’s place and route tools, and the design must be rerun through these tools every time the user desires to watch a different signal. This leads to many time-consuming passes through the place and route tools and many large, unwieldy configuration bitstreams to handle. Third, they provide limited visibility into the state of the circuit—only the signals that are routed to pads or to the particular chip’s built-in logic analyzer are visible. Lastly, they do not provide a mechanism for modifying the state of the circuit, an important debug feature similar to how the values of variables can be modified in a conventional software debugger. 2.1.2 Structured Methods A prominent example of a structured technique is readback, a built-in mechanism which allows a user to retrieve an FPGA’s configuration bitstream. The configuration bitstream in and of itself is not useful for debugging, but when read back from an FPGA, it contains the current state of the FPGA’s flip-flops and memories. This state data can then be loaded into a simulator which will provide the user with the complete circuit state. Another technique is to use partial reconfiguration to load the internal circuit state from an external source to bring the circuit to a known state during functional verification. This allows the user to test corner cases or to return the circuit to a known state just before a point of failure without the tedious process of determining a series of inputs—if one even 6 exists—that will bring the circuit to the desired state. A third technique is to use scan chains that are inserted into user circuitry in a manner similar to the way flip-flop scan chains are employed for VLSI testing [1]. This version of scan allows the information contained in the flip-flops and embedded memories to be scanned out serially through a ScanOut pin to obtain the circuit state. It differs from standard VLSI scan in that its purpose is to obtain the circuit state in order to validate the circuit logic, whereas the main purpose of VLSI scan is to find defects in the silicon after the logic has already been verified extensively in software. With FPGAs, however, silicon validation by the user is unnecessary since it has already been done extensively by the FPGA vendor. Also, due to the reconfigurable nature of FPGAs, FPGA-based scan is removable from the design when functional verification is complete, thus eliminating the overhead that scan incurs. The benefits of using structured methods are many. First, structured methods have the potential to provide complete observability and controllability of the user circuit for functional verification. Thus, not only are all the signals in the circuit visible, the user also has the ability to set the circuit to a known state to aid in verification. Ad hoc methods, on the other hand, allow the designer to view only a portion of the circuit as it runs at full speed, and they do not provide the capability to modify the circuit state. Second, since structured methods provide the visibility of the entire circuit state, only a single pass through the vendor’s place and route tools is necessary, eliminating wasted time and extra configuration bitstreams associated with the many passes required for ad hoc methods. Third, methods like scan can be instrumented systematically and are not design specific, so the instrumentation processes can be automated. Several downsides to using structured methods exist, however. For example, one disadvantage of readback is that it requires the clock to be stopped in order to perform the single-step (static) sampling. This is because readback’s state sampling mechanism is different for flip-flops than it is for embedded memories. In order to maintain consistency between the two, the clock must be stopped during readback. In addition, scan carries significant area and speed overheads due to the added user logic necessary to perform scan. Also, the user circuit does not run at full speed for scan since many clock cycles may be 7 needed to scan the circuit state out and back in to the design. In short, while ad hoc methods are useful for speed testing limited areas of the circuit, providing complete observability and controllability greatly simplifies functional verification, which significantly reduces the time to debug a circuit. The rest of this chapter will consider the current state of the three structured techniques mentioned previously— configuration bitstream readback, design-level scan, and partial configuration—and show how scan overcomes the limitations of the other two to provide complete observability and controllability for functional verification of FPGA-based designs. 2.1.3 Configuration Bitstream Readback Configuration bitstream readback enables a user to view the state of their design, and is provided by a number of FPGA vendors, such as Xilinx and Lucent. Splash-2 [2] and DecPerle [3] are examples of some early configurable computing systems that used readback to debug circuit designs. In Splash-2, for example, information generated by the Xilinx back-end tools provided a mapping between flip-flop values found in the readback bitstream and signal names from the XNF file to provide the user with the internal circuit state. However, the process of synthesizing a VHDL description to XNF, followed by logic trimming, logic and signal merging, and LUT mapping often made it difficult to find a signal given only its original name in the HDL source. As a more recent example of readback using Xilinx, [4] discusses the steps required to obtain all of the state information for their XC4K devices. One major problem involves getting the state of CLB RAMs, as it is necessary to determine how CLB RAM address pins are permuted by the router and then apply those permutations in reverse to obtain the RAM state. Once the readback data is obtained, [5] describes how it can be used for debugging in a combined simulation/hardware execution environment with JHDL [6, 7]. To illustrate how readback works, consider the process of using readback to obtain the circuit state on a Xilinx Virtex FPGA. When readback is performed, the state of all the flip-flops is sampled and written to specific locations in the configuration bitstream. The state of the LUT RAMs and BlockRAMs is not sampled, however, since their state is already maintained in the configuration bitstream. Next, the configuration bitstream exits 8 the chip via one or more pins, such as a JTAG pin. Finally, the current state of all the flip-flops, LUT RAMs and BlockRAMs can be obtained from the configuration bitstream and fed into a simulation environment. Note that the user clock must be disabled during the readback process to maintain coherency between the flip-flop and RAM states. Otherwise, the circuit state may be changing while the configuration bitstream is exiting the FPGA, and although the bitstream will continuously update itself with the new RAM state, the FF state contained in the bitstream will contain the old, sampled values. The obvious advantage of readback is that it comes for free from the designer’s viewpoint when provided as a built-in feature of the FPGA. It does have a number of drawbacks, however. First, readback bitstreams can be extremely large and unwieldy, with only a small percentage of the bitstream being useful for obtaining circuit state. For example, in a Xilinx Virtex V1000 FPGA, the bitstream contains over six million bits, only 9% of which represents the device’s flip-flop and memory state—the rest is configuration data. Second, specific information telling how the vendor’s software tools mapped the logic to LUTs is required to locate the state values of interest from the readback bitstream. This mapping is extremely difficult to obtain for Xilinx XC4000 and Virtex designs. Third, in some cases not all FPGA state is accessible via readback. For example, the state of the output registers of Virtex BlockRAMs is not available from the readback bitstream. Fourth, performing a readback may alter the state of the FPGA. The Virtex BlockRAM output registers are again an example of this, as their state is modified by readback. Fifth, readback requires the ability to stop the external clock, something that is often not supported by commercially-available FPGA systems. Finally, for many FPGA families, no readback support is provided at all, so a different mechanism must be devised to observe the FPGA’s internal state. 2.1.4 Design-Level Scan Another method of obtaining the state of the circuit is to use design-level scan. Sim- ilar to Level-Sensitive Scan Design (LSSD) and other scan methods used in VLSI testing [1], it consists of adding multiplexors and gates to the memory elements of a design— such as flip-flops and embedded RAMs—so that the state elements’ values can be serially 9 shifted out of the FPGA. The main downside of this method is that this added user circuitry may impose a high overhead to implement. The actual area and speed overheads will be addressed later in Chapter 4. Nevertheless, compared with readback, design-level scan has several benefits. First, an FPGA does not require any special capabilities for design-level scan—it can be added to any user design on any FPGA. Readback, on the other hand, is available on only a handful of FPGA systems. Second, the amount of data scanned out of the circuit is much smaller and easier to manipulate than readback bitstreams, since scan bitstreams contain only the desired circuit state information. Third, determining the positions of signal values in the scan bitstream is straightforward since it is easy to determine the order in which the memory elements are arranged in the scan chain. Fourth, the state of the entire circuit can be retrieved by scan, whereas this is not always the case for readback. The output registers of the Virtex BlockRAMs are an example of this, as mentioned previously. Fifth, scan operates on a free running clock, so it does not require the ability to stop the user clock. That is, the circuit runs at full speed until the user is ready to take a snapshot of the circuit state, at which point the clock is still running in order to scan out the circuit state, although no useful work is being done by the circuit during those clock cycles. Sixth, due to the reprogrammable nature of FPGAs, the scan chain can be completely removed from the design after verification, thus eliminating the overhead of the scan logic. Seventh, considerable variations exist in how scan can be implemented. The simplest version is to place a multiplexor in front of every flip-flop to achieve a serial shift chain, as discussed later in Chapter 3. It could also be implemented as in Scan/Set Logic [1] by capturing partial snapshots of a running system’s state without interrupting its operation. Another partial scan variation is to capture only the input and output registers of selected blocks, such as multipliers, to reduce the amount of data to be scanned out of the device. When these blocks have already been previously verified, this level of visibility is often adequate. Lastly, scan allows the state of the circuit to be modified, a feature that readback does not provide. This important ability to bring the circuit into a known state is very useful in functional verification. 10 2.1.5 Partial Reconfiguration The above discussion was mostly limited to enhancing a design’s observability. However, circuit controllability is also an important aspect of debug. For example, some FPGA systems support partial reconfiguration, which reads the state information from a selected block of the circuit, modifies the desired state bits, and writes the state back to the circuit block. Unfortunately, the state of user flip-flops cannot be modified without modifying the actual user design, which then needs to be reset to its power-up state. In addition, many FPGA systems do not support any controllability features at all. In contrast, design-level scan allows the setting of any state element included in the scan chain without disturbing the state of other circuit elements. 2.1.6 Summary In summary, existing FPGAs provide at best only partial visibility into the state of an executing FPGA and little support for configuring a design into a known state. This chapter suggests using design-level scan to overcome the limitations of ad hoc methods, readback, and partial scan techniques to provide complete observability and controllability for functional verification of a user design on an FPGA. The remainder of this work is organized as follows. First, a specific implementation of design-level scan is presented, together with a discussion of some of the CAD tool and hardware issues involved. Next, this implementation is used to quantify the cost associated with design-level scan for several designs taken from the configurable computing research at Brigham Young University. Following this, some variations of and alternatives to scan are presented to show how they can be used to supplement already existing observability and controllability features in some FPGAs. Finally, any loose ends about scan that were not discussed in the previous chapters are tied up, and a conclusion and suggested directions for future work are presented. 11 12 Chapter 3 Implementation of Scan Chain 3.1 Design-Level Scan Implementation The main idea behind inserting user logic into a scan chain involves wiring up the memory elements, such as flip-flops and embedded RAMs, in such a way so as to have the state bits contained in these elements exit the circuit serially through a ScanOut pin whenever the ScanEnable control signal is asserted. New state data for the FPGA concurrently enters the circuit serially on the ScanIn pin. When ScanEnable is deasserted, the circuit returns to normal operation. Figures 3.1 and 3.2 show a high-level view of how this works. RAM In D Q LOGIC D Q Out D Q LOGIC Figure 3.1: Circuit View when ScanEnable is Deasserted RAM ScanIn D Q D Q D Q Figure 3.2: Circuit View when ScanEnable is Asserted 13 ScanOut 3.1.1 Instrumenting Design Primitives When implementing scan, only memory elements are inserted into the scan chain. Each FPGA vendor library has a number of primitive memory elements, such as flip-flops and embedded RAMs, from which larger memory cells can be derived. When inserting these larger memory cells into a scan chain, the easiest approach is often to treat the memory as a group of primitive memory elements, which are individually inserted into the scan chain. This section explains how the various primitive memory elements are instrumented for scan. Instrumenting Flip-Flops FPGA flip-flops (FFs) can be inserted into a scan chain by simply attaching a multiplexor before the data input of the FF and logic gates in front of the enables and set pins, as shown in Figure 3.3. The ScanIn signal in the figure is the ScanOut from the upstream D 0 ScanIn 1 ScanEnable D ScanEnable Clk En Q ScanOut Clk_En Set ScanEnable Set Clk Figure 3.3: Instrumenting a Flip-Flop for Scan memory in the scan chain, and the ScanOut signal becomes the ScanIn for the downstream memory in the scan chain. Thus, when ScanEnable is asserted, the memories in the circuit form a shift register; when ScanEnable is deasserted the circuit returns to normal operation. While ScanEnable is asserted, the FF must be enabled and allow its state bit to be shifted 14 out. The two extra gates in front of the clock enable and set pins in this example serve this purpose. The worst-case area overhead for a scannable FF is to add the multiplexor and two logic gates; fortunately, this price is rarely paid. For example, in many instances, clock enables, sets, and resets in a design are tied to a constant voltage, so the two gates in Figure 3.3 are not required. In other instances, several FFs share the same enables or set/reset logic, so the two gates in Figure 3.3 can sometimes be shared among multiple FFs. In addition, in some cases the LUT in front of a FF is empty or has unused inputs, and can thus be used for either the multiplexor or one of the gates. Figure 3.4 shows an example of how a bank of three FFs would be instrumented for scan if the clock enables are all shared and the sets tied to ground. D0 ScanIn D1 0 1 D Q D2 0 Q0 1 D Clk_En ScanEnable Set Q 0 Q1 1 Clk_En ScanEnable Set D Q Q2 ScanOut Clk_En ScanEnable Set ScanEnable Clk En Figure 3.4: Instrumenting Multiple Flip-Flops for Scan Instrumenting Embedded RAMs Inserting embedded RAMs into scan chains is significantly more complicated than FFs. A RAM has multiple bits to scan out, so it is wired up in such a way so as to operate like a FIFO when ScanEnable is asserted. It outputs its contents one bit per cycle while upstream ScanIn values are concurrently scanned in at one bit per cycle. For some embedded RAMs this is relatively simple to do, for others it can be very difficult. To Illustrate, Figure 3.5 shows three 32X1 RAMs that are connected for scan. The 15 DATA_A 0 ScanIn 1 32X1 RAM DATA_B DIN DOUT WE ADDR_A 5 5 0 1 ScanEnable DATA_C DIN DOUT WE_B ScanEnable ScanOut 5 5 0 1 DIN DOUT ScanOut ScanEnable WE ADDR_B ADDR 32X1 RAM 0 ScanIn 1 ScanEnable ScanEnable WE_A ScanEnable 32X1 RAM 0 ScanIn ScanOut WE_C ScanEnable WE ADDR_C ADDR 1 5 5 0 ADDR 1 ScanEnable ScanEnable ADDRESS GENERATOR Figure 3.5: Embedded RAMs Linked in a Scan Chain ScanIn for each RAM is simply the ScanOut of the upstream memory element. On the first cycle that ScanEnable is asserted, the Address Generator, which is basically an up-counter, produces a value of 0. The data bit stored at address 0 of each RAM in the circuit is read out the Dout port and passed along as the ScanIn to the downstream memory element. On the next cycle, the ScanIn value at each RAM is written into its address 0, the Address Generator produces a value of 1, the data stored at address 1 of each RAM is read out the Dout port and passed down the scan chain, and so on. After 32 cycles of this, the RAMs see an address of 0 again, and the process repeats, resulting in a FIFO. Several details need to be addressed in order for this to work. For instance, any RAM to be inserted into the scan chain in this manner must be able to perform a read and a write in the same cycle. In the case of single-ported RAMs, it must also exhibit write-afterread behavior to avoid destroying unread data. For multi-ported RAMs, the write-after-read behavior is not required if the RAM supports different addresses for reading and writing. If the RAM does not meet these criteria, it must first be replaced by a comparable RAM that does at the time of insertion into the scan chain. An example of a single-ported RAM is the synchronous LUT RAM in the Xilinx XC4000 and Virtex families. It is an asynchronously-read, synchronously-written (ARSW) memory and is straightforward to instrument for scan. During each cycle of scan, a ScanOut value is asynchronously read at the RAM’s output port while a new ScanIn value is written 16 to that same location on the next clock edge, as explained previously. In addition, since the read is asynchronous, data is available for shifting out on the ScanOut port during the same cycle that ScanEnable is first asserted. D0 0 1 32X3 RAM ScanEnable D1 0 D0 O0 D1 O1 D2 O2 ScanOut 1 ScanEnable D2 0 ScanIn 1 ScanEnable WE WE ScanEnable ADDR 5 ADDR 5 ADDRESS GENERATOR 5 0 1 ScanEnable Figure 3.6: Multi-Bit ARSW RAM Instrumented for Scan Figure 3.6 shows how this works for a single 32X3 RAM cell. The scan logic must be designed so as to allow normal circuit operation when not operating in scan mode. The multiplexors and OR gate in Figure 3.6 serve that purpose. Since the output of the RAM in the figure is three bits wide, the output bits are wrapped back around to the inputs during scan to form one continuous FIFO. Thus, RAM output is wrapped back to the input, is wrapped back to , and is the ScanOut value. This can be extended to any width memory desired. Also, the address generator counter is designed to start at a count of zero during the first cycle of scanning out so that the RAM bits are retrieved in a predictable order. Figure 3.7 shows that the address generator can be formed using a counter, a reset signal and a multiplexor. After the FF and RAM contents have been scanned out of the circuit, care must 17 +1 log2(m) Scan Enable D Q 0 0 1 Scan Address log2(m) Reset D Q Figure 3.7: Address Generator for RAM Instrumentation be taken as to what address appears on the address generator when the contents are being scanned back in. To illustrate, consider the example of a memory that is deep in a circuit with a total scan chain length of . An example of such a circuit is a design containing one 16X1 RAM and two FFs. In order to ensure the RAM contents are replaced at their correct addresses when they are scanned back in, address of the RAM needs to be the location written to during the last cycle of scan-in. Since the scan chain length is in this example, this will only be accomplished if the address generator causes the RAM to be written at address on the first cycle the scan bitstream is being scanned back into the circuit. A simple control unit is used to make sure the first bit of data to be written for scanning-in appears at the correct address; in this case, when the address generator is showing an address of . The control unit essentially consists of a counter whose number of count cycles is a function of the largest memory size used in the circuit and the total scan chain length. It controls which cycle data starts being scanned back into the circuit so that all the RAM state bits are placed at the correct memory locations. Consider the case where one or more bits of the RAM address are tied to a constant voltage. In such cases, some sections of the RAM are unused by the circuit and do not need to be included in the scan chain. Hence, the address generator outputs are only connected to those address bits that are not tied to a constant voltage, which reduces the number of scan cycles for the circuit. As an additional benefit, if no other RAM shares this address generator, the size of the address generator can also be reduced. The overhead for instrumenting scan with ARSW RAMs can be determined by 18 examining Figures 3.6 and 3.7. The overhead required to instrument an -bit deep by -bit wide RAM is !"$#% LUTs for the address generator, &'&" LUTs to multiplex the address generator with the normal RAM address signal, 1 LUT to handle the write-enable logic, and LUTs for the wrap-around data multiplexors for a total overhead of ()!'!"*# #+ LUTs. However, if there are multiple RAMs in the circuit, the address generator logic can be shared amongst all the RAMs, meaning that the LUT overhead for each additional RAM in the circuit is only &'&",# #- LUTs. Finally, in the Xilinx Virtex technology, there exists a special kind of LUT-based RAM called the Shift Register LUT (SRL) that also requires consideration. This memory element is inserted into the scan chain as shown in Figure 3.8. When ScanEnable is asserted, the SRL is configured to its maximum size and the entire contents are shifted out a bit at a time. From the figure it can be seen that the LUT overhead required to instrument an -bit deep SRL is &".#/ LUTs. In a similar manner to the other LUT-based RAMs explained previously, if one or more of the address pins on the SRL are tied to a constant voltage, the OR-gate to that address pin is eliminated, thus reducing the overhead of wiring up the SRL for scan. D 0 ScanIn 1 ScanEnable D ScanEnable Clk En Clk_En ScanEnable Addr0 A0 ScanEnable Addr1 A1 ScanEnable Addr2 A2 ScanEnable Addr3 A3 Q ScanOut Figure 3.8: 16-deep SRL Instrumented for Scan 19 Instrumenting Fully Synchronous Embedded RAMs Up until this point, the RAMs being considered all had asynchronous reads with synchronous writes. A challenging consideration now is RAMs with synchronous writes and reads. In the case of many such RAM blocks (Virtex Block SelectRAMs being a typical example), the behavior on a write is to forward the data being written to the RAM port’s output register. This violates the assertion made earlier in this chapter that writes must take place after reads to avoid destroying unread data. However, this behavior is not required for multi-ported RAMs that support different addresses for reading and writing. Thus, the first step for instrumenting such RAMs for scan is to replace the single-ported RAMs with their dual-ported counterparts. The read address for multi-ported RAMs will be one ahead of the write address during scan to create the necessary write-after-read behavior. If a fully synchronous single-ported RAM becomes available that does not have an appropriate substitute to provide the required write-after-read behavior, further research will need to be conducted to determine how such a RAM can be inserted into a scan chain. Instrumenting a fully synchronous RAM with scan is a tricky process. To illustrate, Figure 3.9 shows an example circuit consisting of a 4096-bit synchronous RAM block and two user registers (U1 and U2). During normal operation these three elements are tied to some logic and are not necessarily related. A first attempt at scanning data out of these elements might be to wire them into a serial chain as shown in Figure 3.10. However, a number of issues require modifications to this approach, including the following: U1 D Q SYNCH RAM Din Dout U2 D Q Enable Addr Figure 3.9: Sample Circuit Containing Synchronous RAM 20 SYNCH RAM U1 Din D Q U2 Dout D Q Enable Addr Figure 3.10: First Attempt At Scan-Out for Sample Circuit SYNCH RAM U1 S1 D Q D Q Din U2 Dout D Q Scan Out Bitstream ... U1 SynchRam Bits R S2 U2 D Q D Q Enable Addr Figure 3.11: Corrected Scan-Out Operation Scan In Bitstream ... U1 SynchRam Bits SYNCH RAM U1 R U2 Din D Q Dout Enable Addr Figure 3.12: Scan-In for Sample Circuit 21 U2 0 Due to the synchronous nature of the reads, this set up would overwrite an unread memory location in the RAM during the first cycle of scan-out. The solution, as shown in Figure 3.11, is to provide an additional flip-flop ( 12 ) before the RAM in the scan chain, as well as to inhibit writing to the RAM during the first cycle of scan. Thus, during the first cycle of scan the contents of U1 will be written to S1, and nothing will be written to the RAM to allow a read to take place. On succeeding cycles, the values stored in S1 will be written to the RAM. 0 The first bit to exit a fully synchronous RAM when ScanEnable is first asserted is the current contents of the RAM’s output register, shown in Figure 3.11 as the second bit from the right (labeled 3 ) in the scan out bitstream. The actual contents of the RAM do not start to appear until the second cycle of scan. Since the output registers on Virtex BlockRAMs cannot be reloaded, the 3 bit is considered an extra bit in the scan chain. One option is to remove all the 3 bits from a scan bitstream before loading it back into an FPGA. However, an easier alternative that allows the unmodified scan chain to be fed directly back into the circuit requires only a minor modification to the circuit, as shown in Figure 3.12. A new flip-flop ( 14 ) has been inserted after the RAM during scan-in operation to store the extra 3 bit. It is important to note that the extra register before the RAM is gone during scan-in since data can begin writing into the RAM on the first cycle of scan-in. Figure 3.13 shows an instrumented synchronous RAM which addresses all of these issues. In the figure, ScanningIn is a control indicating that a scan-in operation is taking place. Since the BlockRAM output registers cannot be reloaded, measures must be taken to ensure that the BlockRAM output reflects the correct value on the first cycle after a scan is performed. This can be accomplished by adding the logic shown in Figure 3.14, where a shadow register and a multiplexor on the output of the RAM are used to capture the output register’s contents on the first cycle of scan-out. Note that the value of Dout from the multiplexor does not matter on this cycle. On the first cycle after scan when ScanEnable goes low, the Dout from the multiplexor gets the value contained in the shadow register. 22 Din U1 S1 D Q D Q 0 0 Din SYNCH RAM Din U2 0 Dout 1 1 ScanEnable ScanningIn ENABLE LOGIC ADDRESS GENERATOR S2 0 D Q 1 D Q 1 ScanEnable ScanningIn Enable Addr Figure 3.13: Synchronous RAM Instrumented for Scan Din Dout Scan Out Enable 0 Addr D Q Dout 1 Clk En Scan Enable D Q Figure 3.14: Shadow Output Register for Synchronous RAM 23 The above discussion has assumed the synchronous memory primitives had one-bit wide outputs. As was shown in Figure 3.6, the approach for handling a multi-bit ARSW RAM is to loop each bit back to another input in a daisy-chain fashion. In the case of synchronously-read RAMs however, serial-to-parallel converters are placed in front of the RAM, and parallel-to-serial converters are placed after it. For an bit wide RAM, a read and write are performed once every cycles. The converters then cause the RAM to receive and produce one bit per cycle in the scan chain. The main reason for using the converters instead of following the same approach as the ARSW RAMs is because some FPGAs have synchronously-read RAMs that allow different port widths. This technique covers both the case where all the ports are the same and the case where they are different, with the latter case implying that the reads and writes will occur at different rates. 3.1.2 Storing the Scan Bitstream One more issue involves where to store the state bits between the time they exit the circuit on the ScanOut pin and the time they reenter the circuit on the ScanIn pin. There are many possible solutions to this issue, depending on the particular FPGA-based system being used. One easily implemented solution is to use the system’s external memory to store the scan bitstream. This can be done by storing the scan bitstream into sections of the memory that are unused by the user circuit. If not enough unused memory is available, sections of memory can be swapped out by the host controller long enough to allow the scan bitstream to be stored. After the bitstream is scanned back into the circuit again, the host controller replaces the memory sections it swapped out. A second method is to simply have the host control the bitstream storage by temporarily storing it in its own memory space. 3.1.3 Instrumenting The Design Hierarchy A number of methods exist for actually applying the scan instrumentation described in this chapter. One such method involves making modifications to a design that has already been placed and routed. This technique is difficult to implement since adding the scan logic will greatly interfere with the existing routing, and the circuit will have to be placed and 24 routed all over again. Another option is making modifications to an EDIF netlist. Parsing and modifying an EDIF netlist is also difficult, but it can be a viable solution. A third option is to make the modifications to a circuit database prior to netlisting in the original CAD tool. This option is the approach of choice within the JHDL design environment since it is relatively simple to implement and can easily be automated. In this approach, the user design is first included inside a design “wrapper” that adds the four wires for controlling the scan chain—ScanEnable, ScanIn, ScanOut, and ScanningIn—and connects these and the user’s wires to I/O pins on the FPGA. Next, the instrumentation tool traverses the circuit hierarchy in a depth-first fashion, visiting all design submodules and inserting all primitive memory elements into the scan chain. This is done by adding the four scan signals as ports to each hierarchical cell, and adding scan logic to each flip-flop and embedded RAM, as described in the previous section. Finally, an address generator is added as needed for controlling the memories. Once the design is instrumented, an EDIF netlist is then generated and run through the FPGA vendor’s back-end tools. 3.1.4 Optimizing Scan As is shown later in Chapter 4, the insertion of scan chains into FPGA-based cir- cuits can be costly. A number of strategies will help reduce this overhead. For instance, in many designs, a number of gates and flip-flops will be optimized away by the FPGA technology mapping tools because they have sourceless inputs or loadless outputs. An example of this is when a pipelined array multiplier is created by a module generator where not all multiplier outputs are utilized by the user circuit. Instrumenting the unused circuitry for scan creates more overhead than is necessary for two reasons: not only is circuitry being added to unused flip flops, the added circuitry itself prevents the flip-flops from being optimized away by the back-end tools. Thus, only state elements that are not normally optimized away should be inserted into the scan chain. In the current implementation of scan for Xilinx FPGAs, this is done by running the back-end tools on an uninstrumented version of the design, parsing an XDL representation of the mapped design to determine which flip-flops and embedded memories were optimized away, and storing this information in 25 a file. When the same design is instrumented with scan, the file contents are placed in a hash table, and the table is consulted for each flip-flop and embedded RAM to determine whether it should be inserted into the scan chain. Another kind of optimization that can be performed is resource sharing of clock enable and set/reset logic. Many modules, especially large, regular ones such as multipliers, CORDIC operators, counters, etc., have all their flip-flops share common clock enable and set/reset logic. In addition, RAM modules formed from primitive LUT RAMs share common addressing logic amongst the LUT RAMs. Two options exist for optimization in these situations. The first is to determine, within a particular layer of hierarchy, the sources of all the clock enables, set/reset logic, and addressing logic. The common sources then share the same scan instrumentation logic. The second method is to simply maintain a list of frequently used modules that share common enables and sets/resets for FFs, and address signals for RAMs. Whenever one of these is encountered during the depth-first circuit traversal, shared signals are instrumented only once for scan for the whole module. The first method has greater potential to reduce the overhead of scan than the second method; for example, in the second method, if a module is used that could share a lot of logic, but is not on the list of commonly used modules, the optimization will not be performed. However, the first method is much more difficult to implement correctly than the second method. For the tests reported in the next section, the second approach was taken. Lastly, as mentioned previously, many mechanisms exist for doing partial scan. In the tests reported next, partial scan results are estimated for a number of modules and designs. 26 Chapter 4 Costs of Scan Chain 4.1 The Costs of Design-Level Scan This chapter discusses the costs of instrumenting user circuits with scan chains. Some examples include the extra I/O pins used for the ScanIn, ScanOut, ScanEnable, and ScanningIn control signals mentioned throughout Chapter 3, as well as the off-chip memory required to store the scan bitstream when operating in scan mode discussed in Section 3.1.2. The main concern to a designer, however, is the circuit area and speed overhead of scan. Full scan in VLSI has reported 5% to 30% area overheads [8, 9, 10]. This chapter will show that the area and speed overheads of full scan in FPGAs is much greater than this. 4.1.1 Scan for Library Modules To begin with, consider three modules taken from the JHDL Virtex libraries. These modules consist of a counter, an array multiplier, and a CORDIC unit. The counter is a simple 4-bit up-counter, as shown in Figure 4.1. This circuit consists of a registered 4-bit adder that adds a constant 1 to the count output each cycle. Table 4.1 shows that the counter normally requires four 4-input LUTs (4-LUTs) and four flip-flops (FFs), as is expected. Instrumenting the counter for scan involves adding an extra multiplexor for the data input and an extra OR gate to each FF in the design as explained in Section 3.1.1 (no sets or resets were used in the counter). Thus, Table 4.1 shows that a total of twelve LUTs is required to instrument the counter for scan, which makes the circuit times the size of the original. However, a more optimal approach may be taken since all the FFs share a common clock enable. The optimized approach, then, is to OR the clock enable with the ScanEnable and 27 D Q Out0 Q Out1 Q Out2 Q Out3 Clk En D Clk En 4 +1 D Clk En D Clk En Clk En Figure 4.1: A 4-Bit Up-Counter have this single OR gate drive the clock enables for all four FFs, as shown in Figure 4.2. The result is to reduce the total LUT count to nine, which makes the optimized circuit 657 times as large as the original, as shown in Table 4.2. The next module in the tables is a 16-bit-by-16-bit, fully-pipelined array multiplier for which only the upper 16 bits of the product are used. Figure 4.3 shows conceptually how 16 multiplier cells in each pipelined state are arranged into 16 pipelined stages, with one bit of input X being multiplied with the entire input Y each clock cycle. The skew registers shown in the figure allow each bit of X to enter the multiplier cells during the correct cycle. Figure 4.4 shows a single multiplier cell, which requires two FFs to pipeline the Y input and the partial product of the multiplier. Since only the upper 16 bits of the result are used, no deskewing FFs are required for the multiplier’s output. This design has many more FFs than 4-LUTs due to the skew regististers for X and the two pipeline registers. As shown in Table 4.1, the number of LUTs required for the normal multiplier is approximately 98:;" , and the number of FFs used in the multiplier is the sum of the FFs for the skew register and the two pipeline registers, for a total of 6& FFs. The 28 0 1 ScanIn D Q Out0 Q Out1 Q Out2 Q Out3 (ScanOut) Clk En ScanEnable 0 1 4 +1 D Clk En ScanEnable 0 1 D Clk En ScanEnable 0 1 D Clk En ScanEnable Clk En ScanEnable Figure 4.2: A 4-Bit Up-Counter Instrumented for Scan Table 4.1: Design-Level Scan Costs for a Few Modules without Optimizations Module Num of Normal Scan w/o optimizations FFs LUT Speed LUT LUT Speed Speed Count (MHz) Count Ratio (MHz) Ratio cnt 4 4 165.67 12 3 137.01 0.83 mult 615 270 103.63 1470 5.44 85.87 0.83 cordic 768 780 75.79 2301 2.95 63.42 0.76 averages 3.80 0.81 Table 4.2: Design-Level Scan Costs for a Few Modules with Optimizations Module Num of Normal Scan w/ optimizations FFs LUT Speed LUT LUT Speed Speed Count (MHz) Count Ratio (MHz) Ratio cnt 4 4 165.67 9 2.25 131.56 0.79 mult 615 270 103.63 871 3.23 85.32 0.82 cordic 768 780 75.79 1596 2.05 60.75 0.80 averages 2.51 0.80 29 X0 X1 X15 Skew Registers Y0 16 Pipelined Stages Y1 16 Cells Y15 Figure 4.3: Conceptual View of a 16x16 Array Multiplier X Y PPin D x/+ Q Y_Delay Q PPout Clk En PPout D Clk En PPin = Partial Product In PPout = Partial Product Out Clk En Figure 4.4: A Single Multiplier Cell 30 unoptimized scan insertion approach adds a multiplexor and an OR gate to every FF. This requires <=>?@ additional 4-LUTs for a total of ; ( 65A times the number in the original circuit). The optimized version shares one OR gate for the clock enable for all FFs, as illustrated in Figure 4.5. (Although the skew registers for X are not shown in the figure, they are instrumented for scan similar to the manner shown in the example in Figure 3.4). In this case, the design has only 657 times as many LUTs as the original. X Y PPin x/+ ScanIn PPout 0 1 D Q Y_Delay Q PPout (ScanOut) Clk En ScanEnable 0 1 PPin = Partial Product In D Clk En ScanEnable PPout = Partial Product Out Clk En ScanEnable Figure 4.5: A Single Multiplier Cell Instrumented for Scan The last module in Tables 4.1 and 4.2 is a 16-bit, fully-pipelined rotational CORDIC unit containing 15 iterations, with each iteration providing one more bit of accuracy to the output. Since the CORDIC is unrolled to make it fully pipelined, each iteration corresponds to a stage, as shown in Figure 4.6. Each stage of this design consists of three registered adders/subtractors, as shown in Figure 4.7. Note that the shifters in the figure simply reorder the input signals, and do not contain any logic. This design is quite different from the multiplier, since the number of LUTs and FFs is roughly the same. Figure 4.8 shows the optimized version of scan for the CORDIC unit, with a single OR gate controlling the 31 Xout Xin 15 Pipelined Stages Yin Yout Zin Zout Figure 4.6: A Fully-Pipelined Rotational CORDIC Unit 16 Xin +/− 16 Shift Xout Q Clk En Shift +/− 16 16 D 16 16 D Yin Yout Q Clk En 16 Zin +/− 16 16 D Constant Clk En Clk En Figure 4.7: One CORDIC Stage 32 Q Zout clock enables on all the FFs. The three “FF Scan Chain” blocks shown in the figure simply consist of a bank of sixteen FFs hooked up to a scan chain, in the same manner as the example shown in Figure 3.4 (but without the OR gate). As can be seen in Table 4.2, the result is a design with a little more than twice the number of LUTs as the original. 16 16 ScanIn 16 Xin D ScanIn ScanEnable Clk En Xout Q ScanOut FF Scan Chain +/− Shift 16 16 D ScanIn ScanEnable Clk En Shift +/− 16 Yin Yout Q ScanOut FF Scan Chain 16 Zin +/− 16 16 D ScanIn ScanEnable Clk En Constant Q ScanOut Zout ScanOut FF Scan Chain ScanEnable Clk En Figure 4.8: A Fully-Pipelined Rotational CORDIC Unit Instrumented for Scan Two important notes about the increase in LUTs for scan logic are in order. First, some extra LUTs in addition to those used for the multiplexor and logic gates may be required when instrumenting scan for place and route purposes. For example, if many FFs to be scanned are packed close together in the original circuit, the vendor’s place and route tools may require some extra LUTs to help route the scan multiplexors to the appropriate FFs. Second, sometimes the increase in LUTs for scan is lower than expected since some of the scan instrumentation logic is folded into existing 4-LUTs from the user circuit. This takes place when 4-LUTs from the original design have unused inputs that can be used by some of the instrumented scan logic. To illustrate, Figure 4.9 shows that if the logic generating the clock enable signal uses two inputs of a 4-LUT, the Scan Enable signal can 33 be routed to a third input of the LUT so that the OR gate does not require an additional LUT. Section 4.1.4 will show what kind of impact logic packing has on scan for several designs. 4−LUT 4−LUT Global Clock Enable Global Clock Enable Reset Reset Clk En Clk En Scan Enable Figure 4.9: Folding Scan Logic Into Existing 4-LUTs Notice from Tables 4.1 and 4.2 the speed penalties suffered by the three library modules when instrumented for scan. On average, these circuits can operate at only about BCD of the original frequency with the scan logic is inserted. The effect of this hit depends on the application; some circuits have a minimum operating speed that must be met, whereas for other circuits, the operating speed is not as important. Notice as well that, as scan is optimized for area, the speed at which the circuit may operate can be reduced. Thus, sometimes trading off increased area for increased speed is necessary, depending on the application. Since scan is being used for functional verification of the circuit logic and will be removed once the design is verified, this work will not place much emphasis on the speed penalty incurred by scan. The area overheads in Tables 4.1 and 4.2 were determined by counting the number of additional 4-LUTs used for scan instrumentation. However, this number often does not accurately reflect the increase in area of the circuit. To illustrate, a Virtex slice contains two FFs with two corresponding 4-LUTs. If the FF in a particular slice is being used by a particular design but its corresponding 4-LUT is left unused, that unused 4-LUT can be used for the scan multiplexor without increasing the number of slices in the design. In other words, an increase in the number of 4-LUTs does not necessarily mean an increase in 34 circuit area. Thus, a better area metric to use is the logic element (LE), which consists of a single 4-input LUT, carry logic, and a FF. LEs are the basic building blocks of most modern SRAM-based FPGAs, and using it as a metric more acurately shows the area overhead of the design, as it takes into consideration the scan logic that fits into partially filled LEs without increasing the area of the circuit. Table 4.3: Design-Level Scan Costs for a Few Modules—LUT vs. LE Costs Module Normal Scan w/o opt Scan w/ opt LUT LE LUT LUT LE LE LUT LUT LE LE Count Count Count Ratio Count Ratio Count Ratio Count Ratio cnt 4 4 12 3 12 3 9 2.25 9 2.25 mult 270 630 1470 5.44 1485 2.36 871 3.23 871 1.38 cordic 780 812 2301 2.95 2316 2.85 1596 2.05 1596 1.97 averages 3.80 2.74 2.51 1.87 Table 4.3 provides an LE-based comparison of circuit area overheads for the counter, multiplier and CORDIC. As mentioned previously, the counter and the CORDIC use roughly the same number of FFs as 4-LUTs in their designs; as such, most of their LEs are already full before the design is instrumented for scan. So in their case, increasing the number of 4-LUTs also increases the number of LEs by roughly the same amount, since there aren’t any partially filled LEs to place the scan logic in. Hence, the table shows that the overhead in these two cases is about the same whether it is measured in 4-LUTs or LEs. In contrast, the multiplier has many more FFs than LUTs due to the many pipeline registers, so much of the scan logic can be packed into the partially filled LEs left by these B pipeline FFs. Thus the table shows that the multiplier is really only EFHG times as large when instrumented with scan when the area is measured in terms of LEs, as opposed to G>FHIG times as large in terms of LUTs. Coincidentally, for these three modules the LE count and LUT count are the same for optimized scan since the LE count is LUT dominated. This will not necessarily be the case for all designs. However, this table shows that by 35 measuring the overhead in terms of LEs instead of 4-LUTs, a more accurate view of the scan area overhead is seen, which is not quite as large as originally believed. In short, designs that have a high FF-to-LUT ratio allow packing of scan logic into partially filled LEs, thus reducing the area overhead incurred by scan. This is particularly evidenced in the multiplier, where the pipeline registers leave many LEs only half-full. However, having a high FF count can significantly increase the area of the circuit when instrumenting scan, particularly when most of the LEs are full. In the case of these modules, doing full scan on average nearly doubles the number of LEs used by the circuit. For designs that are dominated mostly by combinational logic, however, fewer memory cells need to be scanned, so the area growth is much smaller. 4.1.2 Partial Scan In recent years, partial scan techniques have been extensively researched and de- veloped as a method to reduce the area overheads associated with scan chains. A common approach is to scan only portions of the circuit deemed important for debug. For example, the approach done here is to scan only the input and output registers of certain library modules. When the module has already been well tested and verified, only the state of the input and output registers of the module is necessary for verifying the rest of the circuit; the internal state of the module is not necessary since it is already well-known. Consider, for example, the array multiplier shown previously in Figure 4.3. If this module has already been extensively verified, the partial scan approach is to scan the final output registers and to use separate “shadow registers” to capture the inputs. The pipeline registers are not included in the scan chain since the behavior of the multiplier is already well known. However, the FFs not included in the scan chain must be disabled during scan so that their state isn’t modified. One approach is to add an extra AND gate to such FFs, as shown in Figure 4.10, so that the state of these FFs does not change during scan. However, most of the FFs in the multiplier will require this AND gate since they will not be included in the scan chain. A better approach in this case, then, is to reduce the overhead by using this AND gate to disable the entire multiplier during scan. A single OR gate can then be used for the FFs in the final output register to enable them during scan. A conceptual view 36 D D ScanEnable Clk En Q Q Clk_En Clk Figure 4.10: Disabling Unscanned FFs for Partial Scan of this is shown in Figure 4.11. The three “FF Scan Chain” blocks each consist of a bank of sixteen FFs hooked up to a scan chain, in essentially the same manner as the example shown in Figure 3.4, but without the OR gate for the clock enable. The two “FF Scan Chain” blocks shown by dotted lines in Figure 4.11 represent the two shadow registers, and the “FF Scan Chain” block on the right represents the final output registers of the multiplier. 16 ScanIn ScanEnable D ScanIn ScanEnable Clk En ScanOut Q D ScanIn ScanEnable Clk En ScanOut FF Scan Chain 16 ScanEnable Q FF Scan Chain X Y Clk En ScanEnable 16 16 MULT X 16 PPout Y ScanEnable Clk En Clk En ScanEnable D ScanIn ScanEnable Clk En Q 16 ScanOut FF Scan Chain Figure 4.11: Conceptual View of Partial Scan for the Array Multiplier 37 Product ScanOut Table 4.4: Design-Level Scan Costs for a Few Modules—Full vs. Partial Scan Module Normal Scan w/o opt Scan w/ opt Partial Scan (est.) LE LE LE LE LE LE LE Count Count Ratio Count Ratio Count Ratio cnt 4 12 3 9 2.25 N/A N/A mult 630 1485 2.36 871 1.38 680 1.08 cordic 812 2316 2.85 1596 1.97 894 1.14 averages 2.74 1.87 1.11 Table 4.4 compares full scan costs with estimated partial scan costs. For the multiC plier, the cost is about J LEs— E LE to disable the multiplier during scan, GI LEs to scan the two 16-bit input shadow registers, EK LEs for the scan multiplexors for the 16-bit output register, and E LE for the OR gate for the output register—resulting in approximately BD overhead. The same approach is taken for the CORDIC unit, in that shadow registers are used to capture the inputs, the final output registers are connected to the scan chain, a single AND gate is used to disable the CORDIC during scan, and a single OR gate is used B B to enable the output FFs during scan. The overhead is approximately I LEs— L LEs to scan the three 16-bit input shadow registers, I LEs for the single AND and OR gates, and GI LEs for the scan multiplexors for two of the output registers—the Zout for rotational CORDICs and the Yout for vectoring CORDICs need not be scanned since it has a value of D D 0. This amounts to about EL overhead, which is significantly less than the MN overhead reported for optimized full scan. As shown by these two examples, when partial scan is used for some large modules such as pipelined multipliers, CORDIC units, and other large datapath elements, the overhead is greatly reduced. However, for some library modules, such as counters, all of the FFs need to be scanned to get the state of the counter, so partial scan does not provide any advantages in overhead. 38 Design Eigenray BF Low-power BF CDI Superquant Table 4.5: Area and Speed of Sample Designs Original FF BlockRAM LUT RAM Total LUT Count Count Count Count 2216 0 67 1775 738 30 1935 14559 4478 18 40 5738 4890 0 3658 11806 LE Speed Count (MHz) 2658 14.35 14719 3.70 6675 31.28 14087 29.46 Table 4.6: Design-Level Scan Costs for Sample Designs With Optimized Full Scan Design FF FF LUT LUT LE LE Speed Count Ratio Count Ratio Count Ratio (MHz) Eigenray BF 2222 1.00 3413 1.92 3445 1.30 9.62 Low-power BF 1307 1.77 24245 1.67 24391 1.66 N/A CDI 5455 1.22 12812 2.23 13434 2.01 25.26 Superquant 4896 1.00 32192 2.73 32192 2.29 N/A averages 1.25 2.14 1.82 4.1.3 Speed Ratio 0.67 N/A 0.81 N/A 0.74 Scan for Large Designs Up to this point, the examples in this chapter have dealt only with small JHDL library modules. To determine the cost of scan for complete designs, several large JHDL designs available at BYU were instrumented for scan. The area and speed costs of the original designs are shown in Table 4.5 while the cost of instrumenting these designs with optimized full scan are shown in Table 4.6. The BlockRAM count shown in the first table is for the fully synchronous RAMs used in the designs, and the LUT RAM count is for the RAMs with asynchronous writes and synchronous reads, as described in Chapter 3. The first design in the tables, Eigenray BF, is a sonar beamformer that does matched field processing. It is heavily pipelined and has a significant data path consisting of 2 CORDIC units, 5 multipliers, and other logic. Since it is the only XC4000 design in the table (the rest are Virtex), to keep the comparisons of the different designs consistent, only the 4-LUTs are reported in the table (the H-LUTs for the XC4000 design are ignored). In 39 addition, the 4-LUTs are what dominate the area of the design and are comparable to LEs in understanding the area overhead of scan. The second design, Low-power BF, is described in [11]. It includes a 1024-point FFT unit and an acoustic beamformer, and is similar to the Eigenray BF in that it has many datapath modules (CORDICs, multipliers, etc.). However, due to power constraints, this design is not pipelined at all. The final two designs are related to automatic target recognition and differ from the first two since they are more control intensive rather than data path intensive. CDI is a form of the design reported in [12] whose function is to perform histogramming and peak finding. The Superquant design performs adaptive image quantization to optimally segment images for target recognition. All four of these designs factor in the cost of instrumenting RAMs in addition to FFs; as shown in Table 4.5, Low-power BF and CDI use a significant number of BlockRAMs (only 32 are on a chip) and Low-power BF and Superquant use a great deal of LUT RAMs of various types (32x1, 16x1, dual-ported 16x1, etc.). For Eigenray BF, note in Table 4.6 that, although the LUT count nearly doubles CD . This is because, in the when it is instrumented for scan, the LE overhead is only G original design, the LE count is dominated by flip-flops. Instrumenting the design nearly doubles the number of LUTs required, but many of those LUTs were able to be absorbed into LEs which previously contained only flip-flops. In addition to the cost of instrumenting FF scan logic, some of the other overhead comes from instrumenting the LUT RAMs in the design. Recall from Section 3.1.1 that each LUT RAM requires additional LUTs for multiplexing the data and the address inputs, an OR gate for the write-enable, and some logic for the address generator. For a standard 16X1 LUT RAM, this results in a cost of M LUTs for the address generator (which is paid only once since it is shared by all the LUT RAMs) and K LUTs for the multiplexors and OR gate for each 16X1 LUT RAM. Since this design uses relatively few LUT RAMs, this overhead is fairly small. The rest of the overhead can be explained by Section 4.1.1 in that some extra LUTs are required for routethrough. In addition, some of the scan logic is packed into already existing 4-LUTs, which actually reduces the overhead of scan. Finally, as can be seen by the FF counts for this design in Table 4.6, there is a small overhead of K FFs required to instrument the address generator for the LUT RAMs. Compared to the number of FFs used by the original circuit, 40 this number is negligible. The analysis of some of the other designs in the table is even tricker since they use synchronous BlockRAMs. Before the analysis can be done, consider the LUT and FF overhead of instrumenting BlockRAMs for scan as shown in Figures 4.12 and 4.13, respectively. The bottom curve on the graphs show the cost of instrumenting a single-ported 150 O 4-LUT Overhead 140 130 120 110 100 90 80 1 2 4 Data Width 8 16 Single-Ported BlockRAM Dual-Ported BlockRAM (Lower Bound) Dual-Ported BlockRAM (Upper Bound) Figure 4.12: LUT Scan Overhead for Instrumenting a Single Virtex BlockRAM 80 Flip-Flop Overhead 70 60 50 40 30 20 10 1 2 4 8 16 Data Width Single-Ported BlockRAM Dual-Ported BlockRAM (Lower Bound) Dual-Ported BlockRAM (Upper Bound) Figure 4.13: Flip-Flop Scan Overhead for Instrumenting a Single Virtex BlockRAM 41 BlockRAM (which is instrumented by first converting it to a dual-ported BlockRAM). As the data width of the BlockRAM increases, so does the number of 4-LUTs and FFs needed to instrument it in the form of extra feedback multiplexors and larger converters for the inputs and outputs, as explained in Section 3.1.1. The middle curve is the lower bound curve for the dual-ported BlockRAMs, and represents the case where both data ports are of the specified width. The top curve, or upper bound curve for the dual-ported BlockRAMs, represents the case when data port A of the BlockRAM is of the specified width, while data port B is at the maximum width of 16. The overhead required to instrument a dualported BlockRAM thus falls somewhere between these two bounds. Note that these graphs show the cost of instrumenting a single BlockRAM; some of this overhead is in the form of control logic and can be shared if multiple BlockRAMs are instrumented. (Approximately IPL LUTs and EG FFs is shared by multiple BlockRAMs.) Nevertheless, the cost of instrumenting BlockRAMs is very significant. With this in mind, consider the next design in Tables 4.5 and 4.6—Low-power BF. As can be seen in Table 4.5, the LUT count and LE count for the design are almost the same, so the LE count is dominated by LUTs as opposed to FFs. Thus, not much of the scan logic will be able to be packed into partially filled LEs like it was for Eigenray BF, D D LUT overhead and KK LE overhead. A relatively which is reflected by the similar KN small portion of this overhead comes from the FFs, since the number of FFs is small compared to the number of LUTs in the original design. The majority of the overhead comes C from the LUT RAMs and BlockRAMs found in the design. The design uses G dual-ported BlockRAMs with 4-bit wide data ports (ports A and B). Figure 4.12 shows the scan overC head per BlockRAM to be about E J LUTs, although approximately IJ of those LUTs can BC LUTs per BlockRAM. be shared by all the BlockRAMs, leaving an overhead of about C CVUWBCXZY I'L[IJ LUTs, or So the LUT overhead for the G BlockRAMs is about IJRQTS9G D approximately E&N . Most of the rest of the overhead for this design, however, is caused by the relatively high number of LUT RAMs used in the design. At first glance it would U CC appear that the overhead for the LUT RAMs is about K LUTs since a EMGJ]^ \ EE_`K 16x1 LUT RAM typically requires K LUTs to instrument for scan. However, most of these LUT RAMs were combined in the design to form larger RAMs; hence, much of the scan 42 logic can be shared. The FF overhead for this design is pretty high due to the use of G C BlockRAMs in the circuit. As can be seen in Figure 4.13, the FF overhead is approximately G>E FFs per B BlockRAM. EG of these FFs can be shared by all the BlockRAMs, so about E FFs are C+U BX actually used by each BlockRAM. So the FF overhead is about EG2QaS!G E Q (# of FFs D FF overhead. Fortunately, in this design, for the address generator), which results in a NN most of these extra FFs are packed into already existing LEs. CDI can be analyzed similarly. With CDI, the design has a large number of FFs relative to LUT RAMs and BlockRAMs, so most of the scan overhead comes from instrumenting the FFs. Although the exact overhead numbers for the FFs are unknown, just the CC LUTs. The second scan multiplexors for the FFs alone require an increase of about L>_bJ largest source of overhead comes from the use of 18 dual-ported BlockRAMs with 16-bit data ports in the design. According to Figures 4.12 and 4.13, this translates into a cost C BC of about E&J LUTs and FFs per BlockRAM (minus the overhead that is shared by the BlockRAMs). The LUT RAMs, however, do not have much of an impact with scan in this C circuit, as there are only L LUT RAMs in the entire design. CDI has a high area overhead, D C D with a E&IG increase in 4-LUTs, or a E E increase in LEs. Although Superquant does not use any BlockRAMs, it does use a large number of FFs and LUT RAMs in the design, giving it the highest overhead of all four of the designs in the table. Although the FFs contribute greatly to the scan overhead, the greatest amount of overhead is contributed by the LUT RAMs. This overhead is mostly due to the problem discussed in Section 3.1.4, where the module used to group primitive LUT RAMs to form a larger RAM is not found on the scan instrumenter’s list of commonly used modules. As such, the scan instrumenter has no way of knowing which LUT RAMs share common addressing logic, so each LUT RAM incurs the full penalty of K LUTs when instrumented D D for scan. The price paid is a E&NG LUT increase, or a E&I'M LE increase. In Table 4.6 the speed penalty incurred by scan for these designs is provided. For D Eigenray BF and CDI the average circuit speed after instrumenting scan is only N'L of the original speed. For Low-power BF and Superquant, no speed data is given for scan. This is because instrumenting scan actually caused these circuits to be too large to fit the FPGA! 43 This is clearly a problem; some possible solutions include using a larger FPGA for debug (provided it has the same pin out as the smaller FPGA), implementing better optimization techniques such as the first of the resource sharing techniques described in Section 3.1.4, or using partial scan techniques. Partial scan was described in Section 4.1.2 and estimated results were given for the multiplier and CORDIC unit. To give an example of how partial scan might help for a larger circuit, consider the Eigenray BF case, which contains, among other logic, 2 CORDIC units and 5 array multipliers of varying sizes. To estimate the cost of using partial scan, it is assumed that the circuit is instrumented for full scan with the exception of the multipliers and CORDICs, which are partially scanned as described in Section 4.1.2. By actually instrumenting the Eigenray BF with full scan without instrumenting the CORDICs and multipliers, and then adding to the resulting LE count the estimated partial scan costs of the CORDICs and multipliers, it is estimated that a partially scanned Eigenray BF would cost C D G>_`G K LEs, which is a IPL overhead over the original non-scan design—an improvement CD over the G overhead required for full scan. Though not conclusive, these results are encouraging and suggest that partially scanning library elements can reduce scan overhead while giving up little visibility. To sum up these results, it is clear that LE counts are a more accurate method of determining the actual area overhead for instrumenting scan than LUT counts because many LUTs added by scan can be packed into partially filled LEs. However, LUT counts can be useful to get an idea as to how much logic was actually instrumented for scan. Also, RAMs are extremely costly to instrument for scan—a dozen or more BlockRAMs can literally require thousands of extra LUTs to instrument; LUT RAMs are also very expensive, especially when they are not optimized for scan and pay the full penalty of scan. In addition, scan slows down the speed of the circuit when running in normal operation. This can be a problem for circuits required to run at or above certain frequencies. Lastly, it is clear that scan can make large circuits too big to fit on the FPGA. To get around this, either a larger FPGA needs to be utilized or some techniques must be used to decrease the overhead of scan. 44 4.1.4 Packing Scan Logic into Existing LUTs As was previously illustrated in Figure 4.9, one effect of instrumenting scan is the packing of instrumentation logic into already existing 4-LUTs used by the original circuit. If most of the additional scan logic gets packed into already existing LUTs, there won’t be much area overhead. Unfortunately, Table 4.7 shows that little of the scan logic actually gets packed into already existing LUTs. The table shows for a variety of Virtex designs after they have been instrumented with scan: (1) the percentage of LUTs containing the original user logic (including user LUTs that have been packed with scan logic), (2) the percentage of LUTs containing only scan circuitry—i.e., LUTs purely containing scan overhead, and (3) the percentage of the total scan logic that was packed into already existing LUTs— that is, scan logic that does not affect the area of the circuit in any way. As can be seen from the table, some designs, such as Cnt and Cordic, do not have any scan logic packed into existing LUTs. This means that these designs pay the full LUT overhead for scan. D of its scan logic packed into already existing LUTs, CDI, on the other hand, has over EG which reduces the actual scan overhead for the design. The average for these designs is to D of the scan logic packed into already existing LUTs. Although this number have only J does not take into account the scan LUTs that do not take up any extra area since they are packed into partially filled LEs, this table shows that user designs come close to paying the full overhead for scan, with very little of the scan logic being packed into already existing 4-LUTs. 4.1.5 Using Dedicated Scan Multiplexors Due to the heavy use of flip-flops in many designs, much of the overhead from instrumenting scan chains comes from adding the scan multiplexors to the flip-flops, as described in Section 3.1.1. One way to reduce this overhead would be for the FPGA vendor to provide dedicated scan multiplexors for each flip-flop. For example, in the Xilinx Virtex technology, some dedicated multiplexors exist, such as the library primitives muxf5, muxf6, and muxcy. These multiplexors add additional logic to a Virtex slice or CLB beyond that provided by the 4-LUTs: the muxcy is used for carry logic and the muxf5 and muxf6 are used to provide additional logic, such as 4:1 and 8:1 multiplexors, without consuming extra LUT 45 Design Cnt Mult Cordic Low-power BF CDI Superquant averages Table 4.7: Logic Packing for Sample Designs % LUTs Containing % LUTs Containing % of Scan Logic User Logic Only Scan Logic Packed Into User LUTs 44.4% 55.6% 0.00% 29.8% 70.2% 5.88% 48.4% 51.6% 0.00% 59.5% 40.5% 1.75% 44.6% 55.4% 13.22% 34.9% 65.1% 9.60% 43.6% 56.4% 5.08% resources. If Xilinx and other FPGA vendors added a dedicated primitive such as muxscan, this multiplexor could be used instead of 4-LUTs to provide the scan muxes for FFs and RAMs without consuming valuable LUT resources. Table 4.8 estimates that the LUT and LE overheads would be significantly reduced by using a dedicated scan multiplexor. Since such a multiplexor does not exist, the results in the right-hand column are based on instrumenting each design with full scan, with the exception that it does not include the scan multiplexor with each FF. The assumption is that a dedicated scan multiplexor will be used instead, which will not use up any extra LUTs to instrument the multiplexor. It D D shows that the average LUT overhead associated with scan is reduced from EEL to KJ , B D BD to J . Note that this optimization works best and the LE overhead is reduced from I in designs that use many FFs; for designs with relatively few FFs, such as Low-power BF, the difference is minimal. Another similar approach is to use already existing dedicated multiplexors to instrument scan. The Xilinx Virtex primitives muxf5 and muxf6 can potentially be used for this purpose. The key is to use one of these primitives as the scan multiplexor if it is unused in the particular slice or CLB by the original design; otherwise, a regular 4-LUT must be used. This will potentially reduce the number of 4-LUTs consumed in the process of instrumenting scan. However, this implementation does have several drawbacks. First, it is very difficult to implement—since scan instrumentation logic is added prior to netlisting, there is no way of knowing which slices and CLBs the muxf5s and muxf6s will mapped to. 46 Table 4.8: Design-Level Scan Costs Using a Dedicated Scan Mux Design Normal Full Scan Full Scan w/ scan-mux LUT LE LUT LUT LE LE LUT LUT LE LE Count Count Count Ratio Count Ratio Count Ratio Count Ratio Eigenray BF 1775 2658 3413 1.92 3445 1.30 1997 1.13 2855 1.07 Low-power BF 14559 14719 24245 1.67 24391 1.66 23722 1.63 24173 1.64 CDI 5738 6675 12812 2.23 13434 2.01 8704 1.52 10417 1.56 Superquant 11806 14087 32192 2.73 32192 2.29 27423 2.32 28642 2.03 averages 2.14 1.82 1.65 1.58 To illustrate, if a muxf5 is used as the scan multiplexor of a FF in a slice where the muxf5 is already in use, the muxf5 used for scan will have to be placed in an entirely different slice, which adds an entire additional slice to the overhead of scan instead of only an additional 4-LUT for the scan multiplexor. Second, since a slice can have two FFs but only one muxf5, if both FFs in the slice are being used, the muxf5 can only provide the scan logic for one of the FFs. Third, the muxf6 and muxcy have limited access—the muxf6 can only use the outputs of a muxf5 as its inputs, and the muxcy can essentially only be used as carry logic. If the FPGA vendor made these multiplexors more accessible for scan use, they could be used as scan multiplexors in place of 4-LUTs. Naturally, some tradeoffs exist for implementing either of these two approaches. For instance, additional silicon area would be required, whether adding dedicated scan multiplexors to each slice or improving accessibility to existing dedicated muxes. However, a dedicated scan multiplexor requires far fewer transistors than instancing an entire 4-input LUT for scan. Also, the dedicated multiplexor can be optimized in silicon to increase the speed of the circuit when instrumented with a scan chain. However, if the particular design does not use many flip-flops or if the dedicated scan multiplexors are infrequently used, the overall extra silicon overhead on the chip may not be worth the cost. In addition, having a dedicated multiplexor in silicon may cause normal designs that don’t use scan to run at reduced speeds. 47 4.1.6 Other Cost Issues As mentioned earlier in the chapter, full scan for VLSI has reported overheads on the order of 5–30%. The costs for implementing full scan for FPGAs as reported in this chapter are far greater than this. An obvious question, then, is why scan costs so much more for FPGAs than it does in VLSI? The solution lies in the granularity of the devices used for implementing scan logic—transistor logic costs much less than FPGA LUT logic [13]. CD For example, [10] claims that a D flip-flop instrumented for scan is only E larger in area. In an FPGA design, however, a FF uses half of an LE. Thus, instrumenting a FF for scan effectively doubles its size, since the scan multiplexor for the FF requires a LUT, which is also half an LE. The size may be tripled or even quadrupled by using additional LUTs for the clock enable and set/reset scan logic. Section 6.1.2 will discuss scan overheads in FPGAs versus VLSI a little further. This chapter has suggested a few techniques such as partial scan and using special dedicated scan multiplexors to help lower the cost of scan in FPGA devices. Chapter 5 will also discuss some ideas to help alleviate these costs for the purpose of increasing the observability and controllability of user designs. Another potential solution is to have the FPGA vendors provide scan-specific primitives. For example, if a FF design primitive existed that has extra inputs for ScanEnable, ScanIn, and ScanOut, when the user design is instrumented for scan, each FF would be replaced by one of the scan FF primitive that is already optimized for scan. That way, no extra LUTs would be used to implement scan, and the overhead would be a little bit of transistor logic used to create the primitive that would have similar overheads for scan as those shown in VLSI. 48 Chapter 5 Variations of and Alternatives to Scan 5.1 Supplementing Existing Observability and Controllability Thus far, full scan has been proposed as a method for providing full observability and controllability to provide complete functional verification on all types of FPGAs. Full scan is often necessary for providing this capability since many FPGA vendors, such as Altera and Cypress, have neither built-in observability nor controllability features. However, many FPGAs, such as those produced by Xilinx, Lucent, and Atmel, are equipped with limited capability to read or modify the state of a circuit. For example, Xilinx XC4000 and Virtex FPGAs can partially reconfigure the FPGA at run-time, thus providing controllability for the state of the embedded RAMs. However, although the state of embedded RAMs is controllable, the state of the flip-flops can only be controlled when the Global Set/Reset (GSR) is asserted at the beginning of hardware execution; after that, controlling the state of the flip-flops is impossible. In addition, some FPGAs have the ability to capture the state of the circuit through readback, as explained in Section 2.1.3, but this feature is also limited. For example, the state of the output registers on the synchronous BlockRAMs in Xilinx Virtex FPGAs is modified during readback, which invalidates the state of the circuit. One method of achieving full circuit observability and controllability without paying the high overhead of full scan is to take advantage of already existing observable and controllable features of FPGAs. Scan and other related techniques can then be used to overcome the shortcomings of those features. This chapter discusses variations and alternatives to full scan that supplement existing FPGA debug features to provide complete controllability and observability of the user 49 circuit without paying the full cost of full scan. It uses circuit designs on Xilinx Virtex and XC4000 FPGAs to illustrate. Although this chapter generally addresses controllability issues separately from observability issues, these issues are often related. For example, using partial reconfiguration to control the circuit state requires the FPGA to be reconfigured on a frame-by-frame basis. Thus, modifying any portion of the state of an FPGA requires the entire frame to be read and modified. The sequence of events to modify the FPGA state is to (1) read the state of the frame being modified, (2) modify the desired portion of the frame, and then (3) write the frame back into the bitstream. Hence, circuit controllability is dependent on circuit observability. 5.1.1 Strategies to Increase Controllability One of the purposes of scan is to provide the ability to bring the circuit to a known state during debug. Some FPGA vendors, such as Xilinx, already have the ability to externally modify the state of their embedded RAMs—the LUT RAMs and BlockRAMs— through partial reconfiguration. However, the state of the flip-flops cannot be modified externally; since FFs are widely used in many designs, this makes the controllability of Xilinx FPGAs very limited. One option to provide complete controllability of Xilinx FPGAs is to use the builtin partial-reconfiguration features to control the state of the LUT RAMs and BlockRAMs, and to use scan to control just the flip-flops. The area overhead for this method consists of the cost of instrumenting the flip-flops for scan and a minimal amount of extra logic required to disable all other memories to preserve their state during scan. Table 5.2 compares the overhead of this approach to the overhead of full scan for the same large JHDL designs mentioned in Chapter 4. Table 5.1 contains the original flip-flop count, BlockRAM count, LUT RAM count, LUT count, and LE count for these designs. The numbers in Table 5.1 can also be found in Table 4.5 from the previous chapter, and have been duplicated here for convenience to aid in the discussion. As can be seen in Table 5.2, the reduction in LUT overhead for Eigenray BF when only the FFs are instrumented for scan when compared with full scan is actually very D B D LUT overhead down to an K LUT overhead. The LE overhead small, going from a MI 50 Design Eigenray BF Low-power BF CDI Superquant Table 5.1: Area of Sample Designs Original FF BlockRAM LUT RAM Total LUT Count Count Count Count 2216 0 67 1775 738 30 1935 14559 4478 18 40 5738 4890 0 3658 11806 LE Speed Count (MHz) 2658 14.35 14719 3.70 6675 31.28 14087 29.46 Table 5.2: Design-Level Scan on Only Flip-Flops Design Full Scan Scan only flip-flops LUT LUT LE LE LUT LUT LE LE Count Ratio Count Ratio Count Ratio Count Ratio Eigenray BF 3413 1.92 3445 1.30 3306 1.86 3427 1.29 Low-power BF 24245 1.67 24391 1.66 16035 1.10 16035 1.09 CDI 12812 2.23 13434 2.01 10945 1.91 10945 1.64 Superquant 32192 2.73 32192 2.29 20342 1.72 20342 1.44 averages 2.14 1.82 1.65 1.37 also went down only slightly from G CD to IM D . The reason for the small improvement is because most of the memory elements in the design are FFs; there are relatively few RAMs in the design, as is shown in Table 5.1. Thus, instrumenting only the FFs for scan in this design results in almost the same circuitry as instrumenting the design for full scan, so the overheads are about the same. KL D C D CDI tells a similar story, with the LE overhead of scan going from E E down to . This design is also dominated by FFs rather than embedded RAMs, so much of the design must be instrumented for scan. This is still a greater reduction in overhead than for the Eigenray BF case, though, due to CDI’s heavy use of BlockRAMs. Chapter 4 showed the costs of instrumenting BlockRAMs for scan to be extremely high, so not including them in the scan chain achieves a significant improvement in area overhead. Low-power BF and Superquant both have the greatest savings in overhead when D scanning only the FFs, with the LE overhead for Low-power BF going from KK down to D D D down to LL . Both of these M , and the LE overhead for Superquant going from EIM 51 designs have a large number of embedded RAMs, so not instrumenting these RAMs for scan leads to a great savings in area overhead. In addition, since Low-power BF does not use very many FFs in its design, the new overhead for the design is relatively small, D indeed—only M . By instrumenting only the FFs for scan in these designs, the average LE overhead B D D to GN —a significant savings. The price of the three pins being was reduced from I used for scan control is still paid, but this partial scan approach is certainly a far more reasonable solution than performing a full scan. When vendors fail to provide complete circuit controllability, scan may be the only method available for controlling the state of all the memory elements. 5.1.2 Strategies to Increase Observability In addition to providing some controllability features, Xilinx FPGAs provide means of observing the circuit state through readback, as discussed in Chaper 2. One of the problems with readback is that the state of the output registers in Virtex BlockRAMs is altered whenever a readback is performed, thus altering the circuit state. One solution to this problem is to use scan to read the state of the BlockRAMs, and to use readback to read the state of all the other memory elements in the design. Some extra logic will be necessary to disable these other memory elements while scan is being performed on the BlockRAMs to preserve their state. Unfortunately, the problem with this method is the high LUT and FF overhead incurred by scan for BlockRAMs, as shown previously in Figures 4.12 and BC C FFs are needed 4.13 in Chapter 4. These figures show that as many as E&J LUTs and to instrument each BlockRAM in the design for scan. Hence, an alternative solution is definitely preferrable. An alternate approach to this BlockRAM readback problem is to add shadow registers with control logic to the output registers of the BlockRAMs to save their state during a readback, as shown in Figure 5.1. To gain an understanding of how this extra logic works, consider the normal sequence of events when a readback is performed: (1) the clock edge arrives, (2) the circuit settles, (3) readback is performed. However, since the state of the 52 BlockRAM Din Enable Dout n n F2 Enable Enable F3 D Q ClkEn D Q n 0 n n Dout 1 Pre PrepRB Addr F1 D Q Clr PrepRB Figure 5.1: Logic to Fix Readback Problem with Virtex BlockRAMs BlockRAM output registers is modified during a readback, it will propagate incorrect values to the rest of the circuit as soon as the next clock edge arrives. Using the logic shown in Figure 5.1, here is the new sequence of events: (1) the clock edge arrives, (2) the circuit settles, (3) the control signal PrepRB goes high and then immediately low again, acting like the rising and falling edge of a clock for the bank of flip-flops labeled F3. When this occurs, F3 now contains a copy of the current state of the BlockRAM output registers. In addition, the flip-flop labeled F1 is asynchronously reset by PrepRB so that the multiplexor selects F3 as the Dout output during that cycle. Finally, (4) readback is performed, but since the Dout seen by the rest of the circuit is really the contents of F3, it doesn’t matter that the contents of the BlockRAM output registers have been modified. This process repeats for each cycle that a readback is performed. A few more details about this circuit are in order. During normal BlockRAM operation, the output registers are refreshed each clock cycle. Thus, F3 always gets valid data even if a readback was performed the previous cycle. However, whenever the BlockRAM is disabled (the enable signal is low), the output of the BlockRAM does not change. This means that after a readback is performed, the output register isn’t refreshed with valid data until the enable goes high again. So in Figure 5.1, the flip-flop F2 ensures that F3 is only loaded when the BlockRAM presents valid data on its output registers, and F1 always selects F3 as the Dout signal whenever the Dout from the BlockRAM output registers is 53 invalid. As can be seen in the figure, the overhead of instrumenting the shadow registers for a BlockRAM with an output of size c is I flip-flops and I 4-LUTs for control, plus c flip-flops and c 4-LUTs for the shadow registers for a total overhead of cRQdI flip-flops and c$QVI 4-LUTs for each port of the BlockRAM. However, if the Enable is shared by different BlockRAMs or even by both ports of a dual-ported BlockRAM, the control logic may be shared, thus resulting in a cost of c flip-flops and 4-LUTs for each additional BlockRAM or BlockRAM port with a common Enable. Table 5.3: Cost of Repairing BlockRAM Readback Design Normal With BlockRAM Readback Logic FF LUT LE FF FF LUT LUT LE LE Count Count Count Count Ratio Count Ratio Count Ratio Low-power BF 738 14559 14719 1078 1.46 14809 1.01 15231 1.03 CDI 4478 5738 6675 4880 1.09 6065 1.06 7368 1.10 averages 1.28 1.04 1.07 Table 5.3 shows the cost of instrumenting this readback logic for Low-power BF C and CDI. Low-power BF contains G 4-bit wide dual-ported BlockRAMs, so the worst-case overhead of adding this readback logic for each BlockRAM would be L FFs and 4-LUTs for B the control and FFs and 4-LUTs for the shadow registers (since the BlockRAMs are dualported, they require twice as much logic than the single-ported BlockRAMs). Multiply this C C number by G and the result is a worst-case overhead of GK FFs and 4-LUTs for Lowpower BF. In reality, the overhead is less than this for two reasons: (1) some of the control logic can be shared due to common enables and (2) some of the BlockRAM outputs are loadless, so some of this extra logic gets optimized away by the vendor’s back-end tools. D D for LUTs and G for LEs. The FF overhead The result is an overhead of less than E D seems high at LK due to the relatively few FFs used in the original design. 54 CDI can be analyzed similarly since it contains E B 16-bit wide dual-ported Block- RAMs. This leads to a worst case overhead of L FFs and 4-LUTs for the control logic, GI B FFs and 4-LUTs for the shadow registers, multiplied by E BlockRAMs for a worst-case B overhead of KL FFs and 4-LUTs. Again, some of the control logic is shared and about half of the BlockRAM’s outputs are loadless, so the actual overhead is much less than this. The D CD D and E , respectively, while the FF overhead is M . LUT and LE overhead for CDI is K These overheads are much smaller than those reported for full-scan; in addition, this method only uses one additional pin for the PrepRB signal, as opposed to four additional pins for performing full scan on BlockRAMs. Circuit designs can be instrumented with this readback logic in the same manner as they were instrumented for scan, as described in Section 3.1.3. From these examples we see that scan is fully capable of solving observability and controllability issues when no other alternatives exist. However, when cheaper alternatives can be found, they should be used instead of scan. 5.2 Summary of Results Chapter 4 enumerated the costs of implementing full scan to provide complete observability and controllability of user designs in FPGAs. This chapter has shown how it is possible to apply techniques related to scan to supplement existing FPGA debug capabilities to provide complete observability and controllability at a lower cost. The tables that follow show a summary of these results. Table 5.4 provides a summary of the original area costs for all of the JHDL designs mentioned in Chapter 4. These designs include the three library modules: cnt, mult, and cordic, as well as the four large JHDL designs: Eigenray BF, Low-power BF, CDI, and Superquant. Table 5.5 shows the overhead of instrumenting these designs will full-scan to provide complete observability and controllability of the designs. The methodology of instrumenting these designs was discussed in Chapter 4. As can be seen from the table, the B D average LE area overhead for instrumenting these designs is L , which is nearly double the size of the original designs. 55 Table 5.4: Original Area of User Designs Design FPGA Type Original FF LUT Count Count cnt Virtex 4 4 mult Virtex 615 270 cordic Virtex 768 780 Eigenray BF XC4000 2216 1775 Low-power BF Virtex 738 14559 CDI Virtex 4478 5738 Superquant Virtex 4890 11806 LE Count 4 630 812 2658 14719 6675 14087 Table 5.5: Area of User Designs w/ Full-Scan Design Full Scan FF FF LUT LUT LE Count Ratio Count Ratio Count cnt 4 1.00 9 2.25 9 mult 615 1.00 871 3.23 871 cordic 768 1.00 1596 2.05 1596 Eigenray BF 2222 1.00 3413 1.92 3445 Low-power BF 1307 1.77 24245 1.67 24391 CDI 5455 1.22 12812 2.23 13434 Superquant 4896 1.00 32192 2.73 32192 averages 1.14 2.30 56 LE Ratio 2.25 1.38 1.97 1.30 1.66 2.01 2.29 1.84 Table 5.6: Best-Case Results for Improving Observability Design Using BlockRAM Readback Logic FF FF LUT LUT LE LE Count Ratio Count Ratio Count Ratio cnt 4 1.00 4 1.00 4 1.00 mult 615 1.00 270 1.00 630 1.00 cordic 768 1.00 780 1.00 812 1.00 Eigenray BF 2216 1.00 1775 1.00 2658 1.00 Low-power BF 1078 1.46 14809 1.01 15231 1.03 CDI 4880 1.09 6065 1.06 7368 1.10 Superquant 4890 1.00 11806 1.00 14087 1.00 averages 1.08 1.01 1.02 Table 5.7: Best-Case Results for Improving Controllability Design Scanning Only FFs FF FF LUT LUT LE LE Count Ratio Count Ratio Count Ratio cnt 4 1.00 9 2.25 9 2.25 mult 615 1.00 871 3.23 871 1.38 cordic 768 1.00 1596 2.05 1596 1.97 Eigenray BF 2216 1.00 3306 1.86 3427 1.29 Low-power BF 738 1.00 16035 1.10 16035 1.09 CDI 4478 1.00 10945 1.91 10945 1.64 Superquant 4890 1.00 20342 1.72 20342 1.44 averages 1.00 2.02 1.58 Table 5.8: Best-Case Results for Improving Observability and Controllability Design BlockRAM Readback Logic/Scanning Only FFs FF FF LUT LUT LE LE Count Ratio Count Ratio Count Ratio cnt 4 1.00 9 2.25 9 2.25 mult 615 1.00 871 3.23 871 1.38 cordic 768 1.00 1596 2.05 1596 1.97 Eigenray BF 2216 1.00 3306 1.86 3427 1.29 Low-power BF 1078 1.46 16362 1.12 16584 1.13 CDI 4880 1.09 11371 1.98 11679 1.75 Superquant 4890 1.00 20342 1.72 20342 1.44 averages 1.08 2.03 1.60 57 Sometimes, circuit designers are not nearly as interested in being able to modify their designs as they are in simply being able to completely observe the circuit state. Thus, Table 5.6 shows the best results in overhead that can be achieved in Xilinx Virtex and XC4000 FPGAs when using readback to only observe the circuit state. Since readback’s limitation to observing circuit state is its modification of the BlockRAM output registers during the readback process, the table reflects the cost of instrumenting the extra readback logic discussed in Section 5.1.2. Four of the designs are Virtex designs that do not use BlockRAMs, and one of the designs is an XC4000 design, which does not support the use of BlockRAMs; these designs do not incur any area overhead by using readback to obtain the circuit state. Only two designs shown in the table contained BlockRAMs: Low-power D CD and E LE overhead for instrumenting the extra BF and CDI. These designs carry a G readback logic, respectively. Table 5.7 shows the best-case overhead that can be accomplished when the user desires controllability, but not observability, of the circuit design. These results were obtained based on the premise described in Section 5.1.1 that partial reconfiguration can be used to modify the state of embedded memories, but scan must be used to obtain the state of the FFs in the design. Hence, the overhead in this table is the overhead of instrumenting only the FFs in the design for scan, while adding logic to disable all other memories during scan to preserve their state. Some of these designs contain only FFs with no embedded memories; in such cases, the overheads are the same as for full-scan since all the FFs must be scanned to control their state. For the rest of the designs, the overhead decreases substantially since the embedded memories in the designs do not require the full scan logic. On average, the BD cost of providing complete controllability for these designs is a J LE overhead Finally, Table 5.8 contains the results of combining the extra BlockRAM readback logic with scanning only user FFs to provide complete observability and controllability of CD to instrument these designs. Alall the designs. It shows an average LE overhead of K B D though this is certainly an improvement over the L average overhead of full scan for the same designs, it shows that since FPGA vendors currently do not provide full observability and controllability features on their FPGAs, the cost of attaining such capabilities for debug is very high. 58 Chapter 6 Other Scan Issues 6.1 Overview The chapter ties up some other loose ends pertaining to scan. It begins by discussing some of the FPGA system-level issues associated with scan. For instance, after scan has been performed on a user design, not only must the circuit state be left unaltered, but the state of the external memories, FIFOs, etc. must also remain unaltered by scan. In addition, any reads and writes being performed to external memories must also be unaffected by scan. Next comes a discussion as to why area costs in FPGAs is so high as compared with VLSI, and what kind of area improvements can be made by FPGA-vendors adding built-in scan functionality into their FPGAs. Lastly, this chapter shows how logic can be implemented in a similar manner to scan to stop the user clock during readback. 6.1.1 FPGA System-Level Issues When implementing design-level scan in FPGAs, care must be taken so that the user design is not modified by the scan process. For example, while the state bits are being scanned out and back into the circuit, both combinational and synchronous values are switching all the time. If one of these signals drives the reset of a FF and goes high during the scan process, the FF is then reset and the circuit is unintentionally modified by scan. For user designs, this problem was solved by using an AND gate to disable all sets, resets, and other such signals, as was shown in Figure 3.3. Unfortunately, this solution only works to prevent the user design contained on the FPGA itself from being modified. Frequently, signals contained in the FPGA design 59 control reads and writes to external memories on an FPGA system. This could result in undesired writes to the external memories when the circuit is operating in scan mode. An easily implemented solution, then, is to tri-state the I/O pins connected to the write enables on the external memories during scan. Since the memory write enables are active low, these same I/O pins must also be connected to weak pull-ups to disable writing to the external memories during scan. Another system-level issue involves handling reads and writes to external memories that have begun, but have not yet completed when scan first begins. An easy solution is to buffer the data being read so that it can be used after scan, and to buffer the data being written to ensure the correct state is still written to the memories. 6.1.2 Scan Overhead in FPGAs vs. VLSI As was mentioned in Section 4.1.6, various reasons exist as to why scan is so much more expensive in FPGAs than it is in VLSI. One of the main area considerations is the size of a 4-input LUT versus the size of a standard D flip-flop with clock enable and set/reset. According to [14], the transistor area of such a FF is about 18 transistors, whereas the area for a 4-LUT is about 167 transistors. If the logic required to scan a FF is a multiplexor and two logic gates, as shown previously in Figure 3.3, this requires three LUTs of overhead, or approximately 501 transistors! The same scan logic could be implemented in VLSI for approximately 16 transistors. Thus, one approach that FPGA-vendors can take is to add built-in scan logic to each of the flip-flops on an FPGA. Although this would effectively double the area of C each flip-flop, this is nothing compared to the G X cost associated with using LUT-logic to provide scan capabilities. Another approach is to only provide a special scan-multiplexor, as described in Section 4.1.5. This would cost only an extra L – K transistors more per flipflop, but it could significantly reduce the cost of instrumenting scan. Naturally, another area overhead includes the extra routing required for scan, but obtaining such numbers is beyond the scope of this work. 60 6.1.3 Stopping the Global Clock Readback has been shown as one method to obtain the state of a user circuit. How- ever, as was discussed in Section 2.1.3, readback requires the ability of the FPGA system to stop the user clock. Unfortunately, this is not possible in many FPGA-based systems. Using scan to retrieve the circuit state avoided this limitation since the clock is still running when operating in scan mode, although the circuit is not doing any useful work during those clock cycles. The main disadvantage to scan, though, is the large area overhead it incurs. D D Q Q Clk_En Clk_En GCE GCE = Global Clock Enable Clk Figure 6.1: Global Clock Enable Circuitry for Flip-Flops 0 D D Q Q 1 GCE Figure 6.2: Global Clock Enable Circuitry for Flip-Flops without Clock Enables An approach similar to instrumenting scan logic into user circuitry can be used to 61 Table 6.1: Area Overheads for Clock-Stopping Circuitry Design Normal AND gate approach MUX approach LUT LE LUT LUT LE LE LUT LUT LE LE Count Count Count Ratio Count Ratio Count Ratio Count Ratio Eigenray BF 1775 2658 1835 1.03 2694 1.01 3345 1.88 3371 1.27 Low-power BF 14559 14719 15524 1.07 15673 1.06 16104 1.11 16104 1.09 CDI 5738 6675 6572 1.15 7545 1.13 10426 1.82 10426 1.56 Superquant 11806 14087 21672 1.84 21672 1.54 20065 1.70 20065 1.42 averages 1.27 1.19 1.63 1.34 effectively stop the user clock during a readback. Instead of using a ScanEnable signal, this circuitry uses a Global Clock Enable (GCE) signal. Adding the GCE logic consists of placing an AND gate in front of the clock enable to each flip-flop and in front of the chip enable or write enable to each embedded RAM, as shown in Figure 6.1. Whenever GCE is pulled low, all the FFs and embedded RAMs are disabled. If the flip-flop does not have a clock enable, it must either be replaced by an equivalent FF that does, or else it must be instrumented as shown in Figure 6.2. As seen in the figure, such FFs use a multiplexor in front of the D input so that when GCE is low, the FF does not update with new data. Table 6.1 shows the overhead associated with implementing the AND gate and the multiplexor approaches to stop the global clock. As can be seen from the table, if all of the FFs in the design have clock enables, the AND gate approach may be used, which leads D D more LUTs, or EM more LEs. However, if the FFs do to an average overhead of IN D D in terms of LUTs and GL in terms of not have clock enables, the overhead is about KG LEs. The reason the overhead is much greater for the multiplexor case is because the GCE multiplexor cannot be shared by multiple FFs, but the AND gate can be shared for FFs that share a common enable. Also, the multiplexor logic can require one or more additional LUTs to route the output of the flip-flop back to the input of the multiplexor, which adds additional overhead. 62 Chapter 7 Conclusions 7.1 Conclusions and Future Work This work has described the limitations associated with providing complete observability and controllability for functional verification of FPGA-based designs, and has shown how instrumenting design-level scan overcomes these limitations. For example, some FPGAs provide built-in tools, such as Xilinx’s ChipScope and Altera’s SignalTap features, that provide visibility into the state of the circuit. However, these features not only provide limited visibility into the circuit state, but changing the signals being viewed requires multiple time-consuming runs through the vendor’s place and route tools. Configuration readback is another method for providing circuit visibility, but it also has limitations in viewing the state of the circuit. Lastly, no method currently exists at all for completely configuring the circuit to a known state. A design-level scan methodology was proposed to provide complete observability and controllability for functional verification of FPGA-based designs. This degree of observability and controllability comes at a high cost, however; on average, it roughly doubles CD . When a designer has a circuit that the size of circuits and reduces their clock rates by I needs validating, these costs may be justified if the designer can take advantage of fast hardware execution rather than being forced to use software simulation to validate the design; thus, reducing the overall “time-to-market” for the design. In addition, design-level scan costs are temporary since the scan logic can be removed for the final “production” design. This suggests that the development and debugging environment might benefit from a larger FPGA, while the final production design may fit on a smaller FPGA [15]. The 63 main caveat to this approach, however, is ensuring that the larger FPGA has the same pin out as the smaller FPGA. Clearly, the costs of using FPGA logic over device-level transistors to improve design verification are large. Chapter 5 proposed a few methods to supplement already existing visibility and controllability features, which improves the costs, but it still falls far short of matching the costs for VLSI. The best approach for providing complete observability and controllability of user circuits, then, is to modify the FPGA architectures themselves. Chapter 6 showed how using LUTs for scan consumes an order of magnitude more silicon area than a FF does. Thus, vendor-supplied instrumentation would provide much lower overheads than those seen in these experiments. In addition to flip-flops, vendor-supplied instrumentation should address embedded memories as well. It needs to also support both reading and writing user design state. In the meantime, there are several possible extensions to this work in design-level scan for functional verification of FPGA-based designs. An obvious one is to explore other device-level instrumentation mechanisms—the proposed scan chain is just one possibility. Second, although this work described how to instrument some of the more common design elements for scan, techniques must be developed which can integrate other FPGA design primitives into scan chains. These include I/O blocks and fully-synchronous, single-ported embedded RAMs. A third possibility is to explore scanning I/O blocks and other designlevel primitives through JTAG pins. Fourth, this scan methodology should be extended to work with designs that have multiple clocks or gated clocks—common features in today’s designs. For gated clocks, one possible solution is to OR the gated clock logic with the ScanEnable to force the clock to be enabled during scan. A solution for multiple clocks might require multiple scan chains that share the same control logic and I/O pins. Fifth, the system-level issues described in Section 6.1.1 need to be implemented and automated for actual FPGA-based systems. Finally, more work could be done to research partial scan techniques to improve functional verification at a lower cost. 64 Bibliography [1] T. W. Williams and K. P. Parker, “Design for testability - a survey”, IEEE Transactions on Computers, vol. C-31, no. 1, pp. 2–15, January 1982. [2] J. M. Arnold, “The Splash 2 software environment”, in Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, D. A. Buell and K. L. Pocek, Eds., Napa, CA, Apr. 1993, pp. 88–93. [3] J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati, and P. Boucard, “Programmable active memories: Reconfigurable systems come of age”, IEEE Transactions on VLSI Systems, vol. 4, no. 1, pp. 56–69, 1996. [4] P. Graham, B. Hutchings, and B. Nelson, “Improving the fpga design process through determining and applying logical-to-physical design mappings”, Technical Report CCL-2000-GHN-1, Brigham Young University, Provo, UT, April 2000. [5] B. L. Hutchings and B. E. Nelson, “Unifying simulation and execution in a design environment for fpga systems”, IEEE Transactions on VLSI Systems, to appear. [6] P. Bellows and B. L. Hutchings, “JHDL - an HDL for reconfigurable systems”, in Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, J. M. Arnold and K. L. Pocek, Eds., Napa, CA, Apr. 1998, pp. 175–184. [7] B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting, “A cad suite for high-performance fpga design”, in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, K. L. Pocek and J. M. Arnold, Eds., Napa, CA, April 1999, IEEE Computer Society, p. n/a, IEEE. [8] A. L. Crouch, Design for Test for Digital IC’s and Embedded Core Systems, chapter 3, p. 97, Prentice Hall PTR, Upper Saddle River, NJ, 1999. 65 [9] S. L. Hurst, VLSI Testing: Digital and Mixed Analogue/Digital Techniques, chapter 5, p. 218, Number 9 in IEE Circuits, Devices and Systems Series. Institution of Electrical Engineers, London, 1998. [10] M. J. S. Smith, Application Specific Integrated Circuits, chapter 14, p. 764, AddisonWesley, Reading, Mass., 1997. [11] S. Scalera, M. Falco, and B. Nelson, “A reconfigurable computing architecture for microsensors”, in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, Kenneth L. Pocek and Jeffery M. Arnold, Eds., Napa, April 2000, IEEE Computer Society, p. TBA, IEEE Computer Society Press. [12] M. Wirthlin, S. Morrison, P. Graham, and B. Bray, “Improving the performance and efficiency of an adaptive amplification operation using configurable hardware”, in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, Kenneth L. Pocek and Jeffery M. Arnold, Eds., Napa, April 2000, IEEE Computer Society, p. TBA, IEEE Computer Society Press. [13] AndreĢ DeHon, Reconfigurable Architectures for General-Purpose Computing, PhD thesis, Massachusetts Institute of Technology, September 1996. [14] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, chapter Appendix B, p. 216, The Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, Boston, 1999. [15] S. Trimberger, “A reprogrammable gate array and applications”, in Proceedings of the IEEE., July 1993, vol. 81, pp. 1030–1041. 66