IMPROVING DESIGN OBSERVABILITY AND CONTROLLABILITY

IMPROVING DESIGN OBSERVABILITY AND CONTROLLABILITY
FOR FUNCTIONAL VERIFICATION OF FPGA-BASED CIRCUITS
USING DESIGN-LEVEL SCAN TECHNIQUES
by
Timothy Brian Wheeler
A thesis submitted to the faculty of
Brigham Young University
in partial fulfillment of the requirements for the degree of
Master of Science
Department of Electrical and Computer Engineering
Brigham Young University
February 2001
Copyright c 2001 Timothy Brian Wheeler
All Rights Reserved
BRIGHAM YOUNG UNIVERSITY
GRADUATE COMMITTEE APPROVAL
of a thesis submitted by
Timothy Brian Wheeler
This thesis has been read by each member of the following graduate committee and by
majority vote has been found to be satisfactory.
Date
Brent E. Nelson, Chair
Date
Brad L. Hutchings
Date
Michael J. Wirthlin
BRIGHAM YOUNG UNIVERSITY
As chair of the candidate’s graduate committee, I have read the thesis of Timothy Brian
Wheeler in its final form and have found that (1) its format, citations, and bibliographical
style are consistent and acceptable and fulfill university and department style requirements;
(2) its illustrative materials including figures, tables, and charts are in place; and (3) the
final manuscript is satisfactory to the graduate committee and is ready for submission to
the university library.
Date
Brent E. Nelson
Chair, Graduate Committee
Accepted for the Department
A. Lee Swindlehurst
Graduate Coordinator
Accepted for the College
Douglas M. Chabries
Dean, College of Engineering and Technology
ABSTRACT
IMPROVING DESIGN OBSERVABILITY AND CONTROLLABILITY FOR
FUNCTIONAL VERIFICATION OF FPGA-BASED CIRCUITS USING
DESIGN-LEVEL SCAN TECHNIQUES
Timothy Brian Wheeler
Department of Electrical and Computer Engineering
Master of Science
FPGA devices have become an increasingly popular way to implement hardware
due to their flexibility and fast time to market. However, one of their major drawbacks is
their limited ability to provide complete functional verification of user designs. Readback,
partial reconfiguration, and built-in logic analyzers are all common examples of debug features currently available in many FPGAs, but none of them provide complete observability
and controllability of the user design. To take advantage of the full potential of FPGA
systems, FPGA development tools need to provide this level of observability and controllability to enable designers to quickly find and remove bugs from their circuits.
This work describes the use of design-level scan to provide complete observability
and controllability for functional verification of FPGA-based designs. An overview of
current debug methods is given. A detailed description of implementing design-level scan
is presented. The costs associated with scan are provided, together with strategies on how
to reduce those costs. This work will show that design-level scan is a viable option for
overcoming the limitations of current functional verification techniques.
vi
ACKNOWLEDGMENTS
I want to express sincere thanks to my early engineering professors, including Dr.
Lee Swindlehurst, Dr. Rich Selfridge, Dr. Brent Nelson, and Dr. Doran Wilde, for showing
me the error of my ways as a computer science major and helping me make the transition
into the flourescent light of computer engineering.
I would like to thank Dr. Brent Nelson and Paul Graham for providing me with a
research topic for my thesis.
I would like to thank the back row for being–well, the back row. In particular, Steve
Morrison for his eager desire to plan the lab parties, Major Greg Ahlquist for being the
highest ranking officer in the lab, Brett Williams for having a wife who makes awesome
poppyseed cake, Russell Fredrickson for sharing his paste with me when I got hungry in
nursery, Justin Tripp for giving life to Slaacenstein, Aaron Stewart for turning his desk
drawer into a kitchen, and Preston Jackson for his extreme courage (or is it endurance?) in
doing analog stuff, on a Mac, next to two smelly refrigerators.
I would like to thank the department secretaries for providing us with popcorn during the devotionals and for always being so awesome.
I would like to thank my friends who visited me and brought me food after spending
many of my waking hours (and a few of my sleeping ones) cooped up in a lab without any
windows.
I would like to thank Bugzilla for making sure my Inbox was never empty.
I would like to thank my old textbooks for providing me with a cool monitor stand.
Finally, I would like to thank the air conditioning here in the back row for always
being such a blast and for forcing me to go home early when my fingers became too numb
to type.
viii
Contents
Acknowledgments
vii
List of Tables
xi
List of Figures
xiv
1 Introduction
1.1
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Current State of Debug
2.1
5
Mechanisms to Increase Controllability and Observability in FPGA Designs
5
2.1.1
Ad Hoc Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.2
Structured Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
6
2.1.3
Configuration Bitstream Readback . . . . . . . . . . . . . . . . . .
8
2.1.4
Design-Level Scan . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1.5
Partial Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Implementation of Scan Chain
3.1
1
13
Design-Level Scan Implementation . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1
Instrumenting Design Primitives . . . . . . . . . . . . . . . . . . . 14
3.1.2
Storing the Scan Bitstream . . . . . . . . . . . . . . . . . . . . . . 24
3.1.3
Instrumenting The Design Hierarchy . . . . . . . . . . . . . . . . . 24
3.1.4
Optimizing Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
ix
4 Costs of Scan Chain
4.1
27
The Costs of Design-Level Scan . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1
Scan for Library Modules . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2
Partial Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3
Scan for Large Designs . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.4
Packing Scan Logic into Existing LUTs . . . . . . . . . . . . . . . 45
4.1.5
Using Dedicated Scan Multiplexors . . . . . . . . . . . . . . . . . 45
4.1.6
Other Cost Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5 Variations of and Alternatives to Scan
5.1
5.2
49
Supplementing Existing Observability and Controllability . . . . . . . . . . 49
5.1.1
Strategies to Increase Controllability . . . . . . . . . . . . . . . . . 50
5.1.2
Strategies to Increase Observability . . . . . . . . . . . . . . . . . 52
Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Other Scan Issues
6.1
59
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.1
FPGA System-Level Issues . . . . . . . . . . . . . . . . . . . . . 59
6.1.2
Scan Overhead in FPGAs vs. VLSI . . . . . . . . . . . . . . . . . 60
6.1.3
Stopping the Global Clock . . . . . . . . . . . . . . . . . . . . . . 61
7 Conclusions
7.1
63
Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliography
66
x
List of Tables
4.1
Design-Level Scan Costs for a Few Modules without Optimizations . . . . 29
4.2
Design-Level Scan Costs for a Few Modules with Optimizations . . . . . . 29
4.3
Design-Level Scan Costs for a Few Modules—LUT vs. LE Costs . . . . . 35
4.4
Design-Level Scan Costs for a Few Modules—Full vs. Partial Scan . . . . 38
4.5
Area and Speed of Sample Designs . . . . . . . . . . . . . . . . . . . . . . 39
4.6
Design-Level Scan Costs for Sample Designs . . . . . . . . . . . . . . . . 39
4.7
Logic Packing for Sample Designs . . . . . . . . . . . . . . . . . . . . . . 46
4.8
Design-Level Scan Costs Using a Dedicated Scan Mux . . . . . . . . . . . 47
5.1
Area of Sample Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2
Design-Level Scan on Only Flip-Flops . . . . . . . . . . . . . . . . . . . . 51
5.3
Cost of Repairing BlockRAM Readback
5.4
Original Area of User Designs . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5
Area of User Designs w/ Full-Scan . . . . . . . . . . . . . . . . . . . . . . 56
5.6
Best-Case Results for Improving Observability . . . . . . . . . . . . . . . 57
5.7
Best-Case Results for Improving Controllability . . . . . . . . . . . . . . . 57
5.8
Best-Case Results for Improving Observability and Controllability . . . . . 57
6.1
Area Overheads for Clock-Stopping Circuitry . . . . . . . . . . . . . . . . 62
xi
. . . . . . . . . . . . . . . . . . 54
xii
List of Figures
3.1
Circuit View when ScanEnable is Deasserted . . . . . . . . . . . . . . . . 13
3.2
Circuit View when ScanEnable is Asserted . . . . . . . . . . . . . . . . . . 13
3.3
Instrumenting a Flip-Flop for Scan . . . . . . . . . . . . . . . . . . . . . . 14
3.4
Instrumenting Multiple Flip-Flops for Scan . . . . . . . . . . . . . . . . . 15
3.5
Embedded RAMs Linked in a Scan Chain . . . . . . . . . . . . . . . . . . 16
3.6
Multi-Bit ARSW RAM Instrumented for Scan . . . . . . . . . . . . . . . . 17
3.7
Address Generator for RAM Instrumentation . . . . . . . . . . . . . . . . 18
3.8
16-deep SRL Instrumented for Scan . . . . . . . . . . . . . . . . . . . . . 19
3.9
Sample Circuit Containing Synchronous RAM . . . . . . . . . . . . . . . 20
3.10 First Attempt At Scan-Out for Sample Circuit . . . . . . . . . . . . . . . . 21
3.11 Corrected Scan-Out Operation . . . . . . . . . . . . . . . . . . . . . . . . 21
3.12 Scan-In for Sample Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.13 Synchronous RAM Instrumented for Scan . . . . . . . . . . . . . . . . . . 23
3.14 Shadow Output Register for Synchronous RAM . . . . . . . . . . . . . . . 23
4.1
A 4-Bit Up-Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2
A 4-Bit Up-Counter Instrumented for Scan . . . . . . . . . . . . . . . . . . 29
4.3
Conceptual View of a 16x16 Array Multiplier . . . . . . . . . . . . . . . . 30
4.4
A Single Multiplier Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5
A Single Multiplier Cell Instrumented for Scan . . . . . . . . . . . . . . . 31
4.6
A Fully-Pipelined Rotational CORDIC Unit . . . . . . . . . . . . . . . . . 32
4.7
One CORDIC Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.8
A Fully-Pipelined Rotational CORDIC Unit Instrumented for Scan . . . . . 33
4.9
Folding Scan Logic Into Existing 4-LUTs . . . . . . . . . . . . . . . . . . 34
4.10 Disabling Unscanned FFs for Partial Scan . . . . . . . . . . . . . . . . . . 37
xiii
4.11 Conceptual View of Partial Scan for the Array Multiplier . . . . . . . . . . 37
4.12 LUT Scan Overhead for Instrumenting a Single Virtex BlockRAM . . . . . 41
4.13 Flip-Flop Scan Overhead for Instrumenting a Single Virtex BlockRAM . . 41
5.1
Logic to Fix Readback Problem with Virtex BlockRAMs . . . . . . . . . . 53
6.1
Global Clock Enable Circuitry for Flip-Flops . . . . . . . . . . . . . . . . 61
6.2
Global Clock Enable Circuitry for Flip-Flops without Clock Enables . . . . 61
xiv
Chapter 1
Introduction
1.1 Introduction
FPGA devices are a popular way to implement hardware in many products. Their
reprogrammability gives them a great advantage over conventional ASICs in that it reduces
the overall design risk and time to market. For example, when designing an ASIC, most
of the work is done on the front end; that is, all of the functional design and verification,
timing analysis, etc. are done extensively in software before the design ever reaches silicon.
The simulation process in software is slow and meticulous, and the design engineer must
go to great lengths to ensure that the design will work correctly and operate at the desired
speed when implemented in actual silicon. Once the ASIC design has been extensively
simulated and debugged in software, the process of fabricating the ASIC and doing further
debug and verification of the silicon and hardware circuit is yet another lengthy and costly
step. With FPGAs, however, the hardware is available from the outset; thus, the hardware
implementation takes place much sooner, simulation in hardware is much faster than in
software, corner cases can be tested more easily, and bugs can be found and removed much
more quickly. Their reprogrammability also eliminates the lengthy fabrication step associated with ASICs since designers develop applications directly onto the FPGA hardware.
Although ASICs may be more favorable than FPGAs for extremely large and complex designs, such as a computer processor, where FPGAs simply cannot provide the necessary
speed and logic required for the design, FPGA-based systems provide invaluable solutions
to a wide range of circuit designs.
To take advantage of the full potential of FPGA systems, FPGA development tools
1
ought to provide the same level of visibility and controllability as a software simulator to
enable designers to quickly perform functional verification of the design. This includes
enabling the designer to view and modify the state of the circuit at any given clock cycle.
In addition, this capability should be provided automatically, without having to generate
new configuration bitstreams every time a designer decides to view different signals.
Unfortunately, current FPGA systems and software fall far short of this standard.
For instance, visibility is limited or difficult to achieve in all of the current commercially
available devices. For some FPGAs the only way to view the circuit state is to route individual internal signals to package pads where they can be accessed by external hardware,
such as a logic analyzer. This approach limits the circuit visibility to only those signals
that are routed to the package pads. It also requires additional time-consuming runs of the
vendor’s place and route software each time the user decides to probe a new signal, which
in turn requires the designer to keep track of the many configuration bitstream files associated with the various analyzed signals. Some approaches require the designer to modify the
original design description to include embedded logic analysis circuitry, which often leads
to unintended modifications resulting in errors. Other FPGAs provide fixed circuitry to
serially read out much of the internal state without the need for added user circuitry; however, the entire internal state of the user circuit is not always available, the state sampling
mechanism actually modifies the circuit state in some cases, and these techniques usually
require the user to stop the external clock, something that experience has shown is often
not supported by commercially-available FPGA systems. In addition to providing only limited visibility, none of the vendor-supplied approaches allow the designer to easily modify
the current values of flip-flops during operation, which is something very commonly done
when debugging a circuit. In general, because of these shortcomings, functional verification of FPGA-based designs is largely an ad hoc process that is much more difficult than
it needs to be, and does not exploit the full design power that could otherwise be taken
advantage of in FPGAs.
Several FPGA vendors have provided additional tools in an attempt to increase the
visibility of the circuit state during debug. Some examples include Xilinx’s ChipScope
and Altera’s SignalTap features. Both of these tools provide real-time access to any node
2
on the chip, allowing the design to run at full speed during debug. They even allow some
of the trigger signals to change without re-running the design through the place and route
tools. The downside to this approach, however, is that all of the signals to be traced must
be declared up front, before the design is run through the place and route tools. The circuit
visibility is limited to only these signals, and adding any more signals to this list requires
more time-consuming passes through the place and route software. In addition to providing only limited visibility, tools like ChipScope and SignalTap provide no state-modifying
capabilities at all. While such tools can be useful for debugging small portions of a large
chip or when performing timing analysis of the chip as it runs at full speed, their ability to
provide complete functional verification is very limited.
The purpose of this work is to present a systematic approach for providing complete observability and controllability for functional verification of FPGA-based designs.
The basic strategy is to use an approach similar to design-level scan—implemented with
user circuitry—to provide the user with complete visibility of the circuit state, as well as
the ability to load the internal state with known values from an external source during
the verification process. This allows the user complete control to view and modify state
“variables”, just like in a software debugger. Because the approach is systematic, it can
be automated. Using it does not require the modification of the original design specification, thus eliminating a source of error. This approach also allows users to view any part
of the circuit state1 without additional runs through the vendor’s place and route software.
In addition, since the clock is free running, the circuit can run at full speed until the user
is ready to take a snapshot of the circuit state. Finally, because the approach is based on
user circuitry, it can be made to work with just about any programmable device. The main
downside to this approach, however, is the large area and speed overhead incurred when
instrumenting a user design with scan. Fortunately, due to the reprogrammable nature of
FPGAs, this overhead penalty is only temporary, as the FPGA can be reprogrammed with
the original design without the scan chain once the verification process is complete.
1
Once the state of synchronous circuit elements is known, the values for any combinational portions of
the circuit are easy to infer.
3
4
Chapter 2
Current State of Debug
2.1 Mechanisms to Increase Controllability and Observability in FPGA Designs
Various methods exist to enhance observability and controllability in FPGA-based
designs. Like those described in [1] they can be grouped into two categories: ad hoc
methods and structured methods.
2.1.1
Ad Hoc Methods
Many different ad hoc methods can be used to debug user circuitry. One such
method involves multiplexing internal signals onto external debug ports. These signals
can then be viewed using a logic analyzer to help debug the circuit. Other methods include forms of self-test such as signature analysis. This technique uses Cyclic Redundancy
Checking (CRC) to verify the results of various test sequences. Unfortunately, these two
methods are painful to use and provide limited or no visibility into the actual state of the
user circuit.
A common approach is for individual FPGA vendors to provide their own built-in
logic analyzers, such as Xilinx’s ChipScope and Altera’s SignalTap. Both of these allow
real-time access to any node on the chip, as well as the real-time ability to change the
trigger conditions. Although the number of signals that can be accessed at any given time
is greater than for an external logic analyzer, that number is still limited and the signals
must be declared up front, before the design is run through the place and route tools. For
Altera, adding new signals to view or making any other modifications to the embedded
state analyzer other than changing the trigger conditions requires a full recompilation of
5
the user design, since the modification affects the size and configuration of the hardware.
The same is almost true for Xilinx, except that some of the changes can be made using
their fpga editor software, in which case only a new configuration bitstream needs to be
generated.
The advantages of these ad hoc methods is that they carry very little impact on the
area and speed of the circuit and they allow the circuit to run at full speed during debug.
They are very useful for speed testing of limited areas of the circuit. Unfortunately, ad hoc
methods are limited in their ability to provide complete functional verification. First, they
are usually design-specific—they require designer intervention to insert and use. What
is desired is a structured technique that can be applied automatically and without user
intervention to any possible user design. Second, they require the user to specify upfront
the desired signals to observe before the design is run through the vendor’s place and route
tools, and the design must be rerun through these tools every time the user desires to watch
a different signal. This leads to many time-consuming passes through the place and route
tools and many large, unwieldy configuration bitstreams to handle. Third, they provide
limited visibility into the state of the circuit—only the signals that are routed to pads or
to the particular chip’s built-in logic analyzer are visible. Lastly, they do not provide a
mechanism for modifying the state of the circuit, an important debug feature similar to
how the values of variables can be modified in a conventional software debugger.
2.1.2
Structured Methods
A prominent example of a structured technique is readback, a built-in mechanism
which allows a user to retrieve an FPGA’s configuration bitstream. The configuration bitstream in and of itself is not useful for debugging, but when read back from an FPGA, it
contains the current state of the FPGA’s flip-flops and memories. This state data can then
be loaded into a simulator which will provide the user with the complete circuit state.
Another technique is to use partial reconfiguration to load the internal circuit state
from an external source to bring the circuit to a known state during functional verification.
This allows the user to test corner cases or to return the circuit to a known state just before
a point of failure without the tedious process of determining a series of inputs—if one even
6
exists—that will bring the circuit to the desired state.
A third technique is to use scan chains that are inserted into user circuitry in a
manner similar to the way flip-flop scan chains are employed for VLSI testing [1]. This
version of scan allows the information contained in the flip-flops and embedded memories
to be scanned out serially through a ScanOut pin to obtain the circuit state. It differs from
standard VLSI scan in that its purpose is to obtain the circuit state in order to validate
the circuit logic, whereas the main purpose of VLSI scan is to find defects in the silicon
after the logic has already been verified extensively in software. With FPGAs, however,
silicon validation by the user is unnecessary since it has already been done extensively by
the FPGA vendor. Also, due to the reconfigurable nature of FPGAs, FPGA-based scan is
removable from the design when functional verification is complete, thus eliminating the
overhead that scan incurs.
The benefits of using structured methods are many. First, structured methods have
the potential to provide complete observability and controllability of the user circuit for
functional verification. Thus, not only are all the signals in the circuit visible, the user also
has the ability to set the circuit to a known state to aid in verification. Ad hoc methods,
on the other hand, allow the designer to view only a portion of the circuit as it runs at
full speed, and they do not provide the capability to modify the circuit state. Second,
since structured methods provide the visibility of the entire circuit state, only a single pass
through the vendor’s place and route tools is necessary, eliminating wasted time and extra
configuration bitstreams associated with the many passes required for ad hoc methods.
Third, methods like scan can be instrumented systematically and are not design specific, so
the instrumentation processes can be automated.
Several downsides to using structured methods exist, however. For example, one
disadvantage of readback is that it requires the clock to be stopped in order to perform
the single-step (static) sampling. This is because readback’s state sampling mechanism is
different for flip-flops than it is for embedded memories. In order to maintain consistency
between the two, the clock must be stopped during readback. In addition, scan carries
significant area and speed overheads due to the added user logic necessary to perform scan.
Also, the user circuit does not run at full speed for scan since many clock cycles may be
7
needed to scan the circuit state out and back in to the design.
In short, while ad hoc methods are useful for speed testing limited areas of the
circuit, providing complete observability and controllability greatly simplifies functional
verification, which significantly reduces the time to debug a circuit. The rest of this chapter
will consider the current state of the three structured techniques mentioned previously—
configuration bitstream readback, design-level scan, and partial configuration—and show
how scan overcomes the limitations of the other two to provide complete observability and
controllability for functional verification of FPGA-based designs.
2.1.3
Configuration Bitstream Readback
Configuration bitstream readback enables a user to view the state of their design,
and is provided by a number of FPGA vendors, such as Xilinx and Lucent. Splash-2 [2]
and DecPerle [3] are examples of some early configurable computing systems that used
readback to debug circuit designs. In Splash-2, for example, information generated by the
Xilinx back-end tools provided a mapping between flip-flop values found in the readback
bitstream and signal names from the XNF file to provide the user with the internal circuit
state. However, the process of synthesizing a VHDL description to XNF, followed by logic
trimming, logic and signal merging, and LUT mapping often made it difficult to find a
signal given only its original name in the HDL source.
As a more recent example of readback using Xilinx, [4] discusses the steps required
to obtain all of the state information for their XC4K devices. One major problem involves
getting the state of CLB RAMs, as it is necessary to determine how CLB RAM address pins
are permuted by the router and then apply those permutations in reverse to obtain the RAM
state. Once the readback data is obtained, [5] describes how it can be used for debugging
in a combined simulation/hardware execution environment with JHDL [6, 7].
To illustrate how readback works, consider the process of using readback to obtain
the circuit state on a Xilinx Virtex FPGA. When readback is performed, the state of all
the flip-flops is sampled and written to specific locations in the configuration bitstream.
The state of the LUT RAMs and BlockRAMs is not sampled, however, since their state is
already maintained in the configuration bitstream. Next, the configuration bitstream exits
8
the chip via one or more pins, such as a JTAG pin. Finally, the current state of all the
flip-flops, LUT RAMs and BlockRAMs can be obtained from the configuration bitstream
and fed into a simulation environment. Note that the user clock must be disabled during the
readback process to maintain coherency between the flip-flop and RAM states. Otherwise,
the circuit state may be changing while the configuration bitstream is exiting the FPGA,
and although the bitstream will continuously update itself with the new RAM state, the FF
state contained in the bitstream will contain the old, sampled values.
The obvious advantage of readback is that it comes for free from the designer’s
viewpoint when provided as a built-in feature of the FPGA. It does have a number of drawbacks, however. First, readback bitstreams can be extremely large and unwieldy, with only
a small percentage of the bitstream being useful for obtaining circuit state. For example,
in a Xilinx Virtex V1000 FPGA, the bitstream contains over six million bits, only 9% of
which represents the device’s flip-flop and memory state—the rest is configuration data.
Second, specific information telling how the vendor’s software tools mapped the logic to
LUTs is required to locate the state values of interest from the readback bitstream. This
mapping is extremely difficult to obtain for Xilinx XC4000 and Virtex designs. Third, in
some cases not all FPGA state is accessible via readback. For example, the state of the output registers of Virtex BlockRAMs is not available from the readback bitstream. Fourth,
performing a readback may alter the state of the FPGA. The Virtex BlockRAM output
registers are again an example of this, as their state is modified by readback. Fifth, readback requires the ability to stop the external clock, something that is often not supported
by commercially-available FPGA systems. Finally, for many FPGA families, no readback
support is provided at all, so a different mechanism must be devised to observe the FPGA’s
internal state.
2.1.4
Design-Level Scan
Another method of obtaining the state of the circuit is to use design-level scan. Sim-
ilar to Level-Sensitive Scan Design (LSSD) and other scan methods used in VLSI testing
[1], it consists of adding multiplexors and gates to the memory elements of a design—
such as flip-flops and embedded RAMs—so that the state elements’ values can be serially
9
shifted out of the FPGA. The main downside of this method is that this added user circuitry
may impose a high overhead to implement. The actual area and speed overheads will be
addressed later in Chapter 4.
Nevertheless, compared with readback, design-level scan has several benefits. First,
an FPGA does not require any special capabilities for design-level scan—it can be added
to any user design on any FPGA. Readback, on the other hand, is available on only a
handful of FPGA systems. Second, the amount of data scanned out of the circuit is much
smaller and easier to manipulate than readback bitstreams, since scan bitstreams contain
only the desired circuit state information. Third, determining the positions of signal values
in the scan bitstream is straightforward since it is easy to determine the order in which
the memory elements are arranged in the scan chain. Fourth, the state of the entire circuit
can be retrieved by scan, whereas this is not always the case for readback. The output
registers of the Virtex BlockRAMs are an example of this, as mentioned previously. Fifth,
scan operates on a free running clock, so it does not require the ability to stop the user
clock. That is, the circuit runs at full speed until the user is ready to take a snapshot of
the circuit state, at which point the clock is still running in order to scan out the circuit
state, although no useful work is being done by the circuit during those clock cycles. Sixth,
due to the reprogrammable nature of FPGAs, the scan chain can be completely removed
from the design after verification, thus eliminating the overhead of the scan logic. Seventh,
considerable variations exist in how scan can be implemented. The simplest version is to
place a multiplexor in front of every flip-flop to achieve a serial shift chain, as discussed
later in Chapter 3. It could also be implemented as in Scan/Set Logic [1] by capturing
partial snapshots of a running system’s state without interrupting its operation. Another
partial scan variation is to capture only the input and output registers of selected blocks,
such as multipliers, to reduce the amount of data to be scanned out of the device. When
these blocks have already been previously verified, this level of visibility is often adequate.
Lastly, scan allows the state of the circuit to be modified, a feature that readback does not
provide. This important ability to bring the circuit into a known state is very useful in
functional verification.
10
2.1.5
Partial Reconfiguration
The above discussion was mostly limited to enhancing a design’s observability.
However, circuit controllability is also an important aspect of debug. For example, some
FPGA systems support partial reconfiguration, which reads the state information from a
selected block of the circuit, modifies the desired state bits, and writes the state back to
the circuit block. Unfortunately, the state of user flip-flops cannot be modified without
modifying the actual user design, which then needs to be reset to its power-up state. In
addition, many FPGA systems do not support any controllability features at all. In contrast,
design-level scan allows the setting of any state element included in the scan chain without
disturbing the state of other circuit elements.
2.1.6
Summary
In summary, existing FPGAs provide at best only partial visibility into the state of
an executing FPGA and little support for configuring a design into a known state. This
chapter suggests using design-level scan to overcome the limitations of ad hoc methods,
readback, and partial scan techniques to provide complete observability and controllability
for functional verification of a user design on an FPGA.
The remainder of this work is organized as follows. First, a specific implementation
of design-level scan is presented, together with a discussion of some of the CAD tool and
hardware issues involved. Next, this implementation is used to quantify the cost associated
with design-level scan for several designs taken from the configurable computing research
at Brigham Young University. Following this, some variations of and alternatives to scan
are presented to show how they can be used to supplement already existing observability
and controllability features in some FPGAs. Finally, any loose ends about scan that were
not discussed in the previous chapters are tied up, and a conclusion and suggested directions
for future work are presented.
11
12
Chapter 3
Implementation of Scan Chain
3.1 Design-Level Scan Implementation
The main idea behind inserting user logic into a scan chain involves wiring up
the memory elements, such as flip-flops and embedded RAMs, in such a way so as to
have the state bits contained in these elements exit the circuit serially through a ScanOut
pin whenever the ScanEnable control signal is asserted. New state data for the FPGA
concurrently enters the circuit serially on the ScanIn pin. When ScanEnable is deasserted,
the circuit returns to normal operation. Figures 3.1 and 3.2 show a high-level view of how
this works.
RAM
In
D Q
LOGIC
D Q
Out
D Q
LOGIC
Figure 3.1: Circuit View when ScanEnable is Deasserted
RAM
ScanIn
D Q
D Q
D Q
Figure 3.2: Circuit View when ScanEnable is Asserted
13
ScanOut
3.1.1
Instrumenting Design Primitives
When implementing scan, only memory elements are inserted into the scan chain.
Each FPGA vendor library has a number of primitive memory elements, such as flip-flops
and embedded RAMs, from which larger memory cells can be derived. When inserting
these larger memory cells into a scan chain, the easiest approach is often to treat the memory as a group of primitive memory elements, which are individually inserted into the scan
chain. This section explains how the various primitive memory elements are instrumented
for scan.
Instrumenting Flip-Flops
FPGA flip-flops (FFs) can be inserted into a scan chain by simply attaching a multiplexor before the data input of the FF and logic gates in front of the enables and set pins,
as shown in Figure 3.3. The ScanIn signal in the figure is the ScanOut from the upstream
D
0
ScanIn
1
ScanEnable
D
ScanEnable
Clk En
Q
ScanOut
Clk_En
Set
ScanEnable
Set
Clk
Figure 3.3: Instrumenting a Flip-Flop for Scan
memory in the scan chain, and the ScanOut signal becomes the ScanIn for the downstream
memory in the scan chain. Thus, when ScanEnable is asserted, the memories in the circuit
form a shift register; when ScanEnable is deasserted the circuit returns to normal operation.
While ScanEnable is asserted, the FF must be enabled and allow its state bit to be shifted
14
out. The two extra gates in front of the clock enable and set pins in this example serve this
purpose.
The worst-case area overhead for a scannable FF is to add the multiplexor
and two logic gates; fortunately, this price is rarely paid. For example, in many instances,
clock enables, sets, and resets in a design are tied to a constant voltage, so the two gates
in Figure 3.3 are not required. In other instances, several FFs share the same enables or
set/reset logic, so the two gates in Figure 3.3 can sometimes be shared among multiple FFs.
In addition, in some cases the LUT in front of a FF is empty or has unused inputs, and can
thus be used for either the multiplexor or one of the gates. Figure 3.4 shows an example of
how a bank of three FFs would be instrumented for scan if the clock enables are all shared
and the sets tied to ground.
D0
ScanIn
D1
0
1
D
Q
D2
0
Q0
1
D
Clk_En
ScanEnable
Set
Q
0
Q1
1
Clk_En
ScanEnable
Set
D
Q
Q2
ScanOut
Clk_En
ScanEnable
Set
ScanEnable
Clk En
Figure 3.4: Instrumenting Multiple Flip-Flops for Scan
Instrumenting Embedded RAMs
Inserting embedded RAMs into scan chains is significantly more complicated than
FFs. A RAM has multiple bits to scan out, so it is wired up in such a way so as to operate
like a FIFO when ScanEnable is asserted. It outputs its contents one bit per cycle while upstream ScanIn values are concurrently scanned in at one bit per cycle. For some embedded
RAMs this is relatively simple to do, for others it can be very difficult.
To Illustrate, Figure 3.5 shows three 32X1 RAMs that are connected for scan. The
15
DATA_A
0
ScanIn
1
32X1 RAM
DATA_B
DIN
DOUT
WE
ADDR_A
5
5
0
1
ScanEnable
DATA_C
DIN
DOUT
WE_B
ScanEnable
ScanOut
5
5
0
1
DIN
DOUT
ScanOut
ScanEnable
WE
ADDR_B
ADDR
32X1 RAM
0
ScanIn
1
ScanEnable
ScanEnable
WE_A
ScanEnable
32X1 RAM
0
ScanIn
ScanOut
WE_C
ScanEnable
WE
ADDR_C
ADDR
1
5
5
0
ADDR
1
ScanEnable
ScanEnable
ADDRESS
GENERATOR
Figure 3.5: Embedded RAMs Linked in a Scan Chain
ScanIn for each RAM is simply the ScanOut of the upstream memory element. On the first
cycle that ScanEnable is asserted, the Address Generator, which is basically an up-counter,
produces a value of 0. The data bit stored at address 0 of each RAM in the circuit is read
out the Dout port and passed along as the ScanIn to the downstream memory element. On
the next cycle, the ScanIn value at each RAM is written into its address 0, the Address
Generator produces a value of 1, the data stored at address 1 of each RAM is read out the
Dout port and passed down the scan chain, and so on. After 32 cycles of this, the RAMs
see an address of 0 again, and the process repeats, resulting in a FIFO.
Several details need to be addressed in order for this to work. For instance, any
RAM to be inserted into the scan chain in this manner must be able to perform a read and a
write in the same cycle. In the case of single-ported RAMs, it must also exhibit write-afterread behavior to avoid destroying unread data. For multi-ported RAMs, the write-after-read
behavior is not required if the RAM supports different addresses for reading and writing. If
the RAM does not meet these criteria, it must first be replaced by a comparable RAM that
does at the time of insertion into the scan chain.
An example of a single-ported RAM is the synchronous LUT RAM in the Xilinx
XC4000 and Virtex families. It is an asynchronously-read, synchronously-written (ARSW)
memory and is straightforward to instrument for scan. During each cycle of scan, a ScanOut
value is asynchronously read at the RAM’s output port while a new ScanIn value is written
16
to that same location on the next clock edge, as explained previously. In addition, since the
read is asynchronous, data is available for shifting out on the ScanOut port during the same
cycle that ScanEnable is first asserted.
D0
0
1
32X3 RAM
ScanEnable
D1
0
D0
O0
D1
O1
D2
O2
ScanOut
1
ScanEnable
D2
0
ScanIn
1
ScanEnable
WE
WE
ScanEnable
ADDR
5
ADDR
5
ADDRESS
GENERATOR
5
0
1
ScanEnable
Figure 3.6: Multi-Bit ARSW RAM Instrumented for Scan
Figure 3.6 shows how this works for a single 32X3 RAM cell. The scan logic must
be designed so as to allow normal circuit operation when not operating in scan mode. The
multiplexors and OR gate in Figure 3.6 serve that purpose. Since the output of the RAM in
the figure is three bits wide, the output bits are wrapped back around to the inputs during
scan to form one continuous FIFO. Thus, RAM output is wrapped back to the input,
is wrapped back to , and is the ScanOut value. This can be extended to any width
memory desired. Also, the address generator counter is designed to start at a count of zero
during the first cycle of scanning out so that the RAM bits are retrieved in a predictable
order. Figure 3.7 shows that the address generator can be formed using a counter, a reset
signal and a multiplexor.
After the FF and RAM contents have been scanned out of the circuit, care must
17
+1
log2(m)
Scan Enable
D Q
0
0
1
Scan Address
log2(m)
Reset
D Q
Figure 3.7: Address Generator for RAM Instrumentation
be taken as to what address appears on the address generator when the contents are being
scanned back in. To illustrate, consider the example of a memory that is deep in a circuit
with a total scan chain length of . An example of such a circuit is a design containing
one 16X1 RAM and two FFs. In order to ensure the RAM contents are replaced at their
correct addresses when they are scanned back in, address of the RAM needs to be the
location written to during the last cycle of scan-in. Since the scan chain length is in
this example, this will only be accomplished if the address generator causes the RAM to
be written at address on the first cycle the scan bitstream is being scanned back into
the circuit. A simple control unit is used to make sure the first bit of data to be written
for scanning-in appears at the correct address; in this case, when the address generator is
showing an address of . The control unit essentially consists of a counter whose number
of count cycles is a function of the largest memory size used in the circuit and the total scan
chain length. It controls which cycle data starts being scanned back into the circuit so that
all the RAM state bits are placed at the correct memory locations.
Consider the case where one or more bits of the RAM address are tied to a constant
voltage. In such cases, some sections of the RAM are unused by the circuit and do not need
to be included in the scan chain. Hence, the address generator outputs are only connected
to those address bits that are not tied to a constant voltage, which reduces the number of
scan cycles for the circuit. As an additional benefit, if no other RAM shares this address
generator, the size of the address generator can also be reduced.
The overhead for instrumenting scan with ARSW RAMs can be determined by
18
examining Figures 3.6 and 3.7. The overhead required to instrument an -bit deep by
-bit wide RAM is !"$#% LUTs for the address generator, &'&" LUTs to
multiplex the address generator with the normal RAM address signal, 1 LUT to handle the
write-enable logic, and LUTs for the wrap-around data multiplexors for a total overhead
of ()!'!"*# #+ LUTs. However, if there are multiple RAMs in the circuit, the address
generator logic can be shared amongst all the RAMs, meaning that the LUT overhead for
each additional RAM in the circuit is only &'&",# #- LUTs.
Finally, in the Xilinx Virtex technology, there exists a special kind of LUT-based
RAM called the Shift Register LUT (SRL) that also requires consideration. This memory
element is inserted into the scan chain as shown in Figure 3.8. When ScanEnable is asserted, the SRL is configured to its maximum size and the entire contents are shifted out a
bit at a time. From the figure it can be seen that the LUT overhead required to instrument an
-bit deep SRL is &".#/ LUTs. In a similar manner to the other LUT-based RAMs
explained previously, if one or more of the address pins on the SRL are tied to a constant
voltage, the OR-gate to that address pin is eliminated, thus reducing the overhead of wiring
up the SRL for scan.
D
0
ScanIn
1
ScanEnable
D
ScanEnable
Clk En
Clk_En
ScanEnable
Addr0
A0
ScanEnable
Addr1
A1
ScanEnable
Addr2
A2
ScanEnable
Addr3
A3
Q
ScanOut
Figure 3.8: 16-deep SRL Instrumented for Scan
19
Instrumenting Fully Synchronous Embedded RAMs
Up until this point, the RAMs being considered all had asynchronous reads with
synchronous writes. A challenging consideration now is RAMs with synchronous writes
and reads. In the case of many such RAM blocks (Virtex Block SelectRAMs being a typical
example), the behavior on a write is to forward the data being written to the RAM port’s
output register. This violates the assertion made earlier in this chapter that writes must take
place after reads to avoid destroying unread data. However, this behavior is not required
for multi-ported RAMs that support different addresses for reading and writing. Thus, the
first step for instrumenting such RAMs for scan is to replace the single-ported RAMs with
their dual-ported counterparts. The read address for multi-ported RAMs will be one ahead
of the write address during scan to create the necessary write-after-read behavior. If a
fully synchronous single-ported RAM becomes available that does not have an appropriate
substitute to provide the required write-after-read behavior, further research will need to be
conducted to determine how such a RAM can be inserted into a scan chain.
Instrumenting a fully synchronous RAM with scan is a tricky process. To illustrate,
Figure 3.9 shows an example circuit consisting of a 4096-bit synchronous RAM block and
two user registers (U1 and U2). During normal operation these three elements are tied to
some logic and are not necessarily related. A first attempt at scanning data out of these
elements might be to wire them into a serial chain as shown in Figure 3.10. However, a
number of issues require modifications to this approach, including the following:
U1
D Q
SYNCH RAM
Din
Dout
U2
D Q
Enable
Addr
Figure 3.9: Sample Circuit Containing Synchronous RAM
20
SYNCH RAM
U1
Din
D Q
U2
Dout
D Q
Enable
Addr
Figure 3.10: First Attempt At Scan-Out for Sample Circuit
SYNCH RAM
U1
S1
D Q
D Q
Din
U2
Dout
D Q
Scan Out Bitstream
...
U1
SynchRam Bits
R
S2
U2
D Q
D Q
Enable
Addr
Figure 3.11: Corrected Scan-Out Operation
Scan In Bitstream
...
U1
SynchRam Bits
SYNCH RAM
U1
R
U2
Din
D Q
Dout
Enable
Addr
Figure 3.12: Scan-In for Sample Circuit
21
U2
0 Due to the synchronous nature of the reads, this set up would overwrite an unread
memory location in the RAM during the first cycle of scan-out. The solution, as
shown in Figure 3.11, is to provide an additional flip-flop ( 12 ) before the RAM in
the scan chain, as well as to inhibit writing to the RAM during the first cycle of scan.
Thus, during the first cycle of scan the contents of U1 will be written to S1, and
nothing will be written to the RAM to allow a read to take place. On succeeding
cycles, the values stored in S1 will be written to the RAM.
0 The first bit to exit a fully synchronous RAM when ScanEnable is first asserted is the
current contents of the RAM’s output register, shown in Figure 3.11 as the second bit
from the right (labeled 3 ) in the scan out bitstream. The actual contents of the RAM
do not start to appear until the second cycle of scan. Since the output registers on
Virtex BlockRAMs cannot be reloaded, the 3 bit is considered an extra bit in the scan
chain. One option is to remove all the 3 bits from a scan bitstream before loading it
back into an FPGA. However, an easier alternative that allows the unmodified scan
chain to be fed directly back into the circuit requires only a minor modification to
the circuit, as shown in Figure 3.12. A new flip-flop ( 14 ) has been inserted after the
RAM during scan-in operation to store the extra 3
bit. It is important to note that
the extra register before the RAM is gone during scan-in since data can begin writing
into the RAM on the first cycle of scan-in.
Figure 3.13 shows an instrumented synchronous RAM which addresses all of these
issues. In the figure, ScanningIn is a control indicating that a scan-in operation is taking
place.
Since the BlockRAM output registers cannot be reloaded, measures must be taken
to ensure that the BlockRAM output reflects the correct value on the first cycle after a scan
is performed. This can be accomplished by adding the logic shown in Figure 3.14, where a
shadow register and a multiplexor on the output of the RAM are used to capture the output
register’s contents on the first cycle of scan-out. Note that the value of Dout from the
multiplexor does not matter on this cycle. On the first cycle after scan when ScanEnable
goes low, the Dout from the multiplexor gets the value contained in the shadow register.
22
Din
U1
S1
D Q
D Q
0
0
Din
SYNCH RAM
Din
U2
0
Dout
1
1
ScanEnable
ScanningIn
ENABLE
LOGIC
ADDRESS
GENERATOR
S2
0
D Q
1
D Q
1
ScanEnable
ScanningIn
Enable
Addr
Figure 3.13: Synchronous RAM Instrumented for Scan
Din
Dout
Scan Out
Enable
0
Addr
D
Q
Dout
1
Clk En
Scan Enable
D Q
Figure 3.14: Shadow Output Register for Synchronous RAM
23
The above discussion has assumed the synchronous memory primitives had one-bit
wide outputs. As was shown in Figure 3.6, the approach for handling a multi-bit ARSW
RAM is to loop each bit back to another input in a daisy-chain fashion. In the case of
synchronously-read RAMs however, serial-to-parallel converters are placed in front of the
RAM, and parallel-to-serial converters are placed after it. For an bit wide RAM, a read
and write are performed once every cycles. The converters then cause the RAM to receive
and produce one bit per cycle in the scan chain. The main reason for using the converters
instead of following the same approach as the ARSW RAMs is because some FPGAs have
synchronously-read RAMs that allow different port widths. This technique covers both the
case where all the ports are the same and the case where they are different, with the latter
case implying that the reads and writes will occur at different rates.
3.1.2
Storing the Scan Bitstream
One more issue involves where to store the state bits between the time they exit the
circuit on the ScanOut pin and the time they reenter the circuit on the ScanIn pin. There
are many possible solutions to this issue, depending on the particular FPGA-based system
being used. One easily implemented solution is to use the system’s external memory to
store the scan bitstream. This can be done by storing the scan bitstream into sections of
the memory that are unused by the user circuit. If not enough unused memory is available,
sections of memory can be swapped out by the host controller long enough to allow the
scan bitstream to be stored. After the bitstream is scanned back into the circuit again, the
host controller replaces the memory sections it swapped out. A second method is to simply
have the host control the bitstream storage by temporarily storing it in its own memory
space.
3.1.3
Instrumenting The Design Hierarchy
A number of methods exist for actually applying the scan instrumentation described
in this chapter. One such method involves making modifications to a design that has already
been placed and routed. This technique is difficult to implement since adding the scan logic
will greatly interfere with the existing routing, and the circuit will have to be placed and
24
routed all over again. Another option is making modifications to an EDIF netlist. Parsing
and modifying an EDIF netlist is also difficult, but it can be a viable solution. A third
option is to make the modifications to a circuit database prior to netlisting in the original
CAD tool. This option is the approach of choice within the JHDL design environment since
it is relatively simple to implement and can easily be automated.
In this approach, the user design is first included inside a design “wrapper” that
adds the four wires for controlling the scan chain—ScanEnable, ScanIn, ScanOut, and
ScanningIn—and connects these and the user’s wires to I/O pins on the FPGA. Next, the
instrumentation tool traverses the circuit hierarchy in a depth-first fashion, visiting all design submodules and inserting all primitive memory elements into the scan chain. This
is done by adding the four scan signals as ports to each hierarchical cell, and adding scan
logic to each flip-flop and embedded RAM, as described in the previous section. Finally, an
address generator is added as needed for controlling the memories. Once the design is instrumented, an EDIF netlist is then generated and run through the FPGA vendor’s back-end
tools.
3.1.4
Optimizing Scan
As is shown later in Chapter 4, the insertion of scan chains into FPGA-based cir-
cuits can be costly. A number of strategies will help reduce this overhead. For instance,
in many designs, a number of gates and flip-flops will be optimized away by the FPGA
technology mapping tools because they have sourceless inputs or loadless outputs. An example of this is when a pipelined array multiplier is created by a module generator where
not all multiplier outputs are utilized by the user circuit. Instrumenting the unused circuitry
for scan creates more overhead than is necessary for two reasons: not only is circuitry being added to unused flip flops, the added circuitry itself prevents the flip-flops from being
optimized away by the back-end tools. Thus, only state elements that are not normally optimized away should be inserted into the scan chain. In the current implementation of scan
for Xilinx FPGAs, this is done by running the back-end tools on an uninstrumented version
of the design, parsing an XDL representation of the mapped design to determine which
flip-flops and embedded memories were optimized away, and storing this information in
25
a file. When the same design is instrumented with scan, the file contents are placed in a
hash table, and the table is consulted for each flip-flop and embedded RAM to determine
whether it should be inserted into the scan chain.
Another kind of optimization that can be performed is resource sharing of clock
enable and set/reset logic. Many modules, especially large, regular ones such as multipliers,
CORDIC operators, counters, etc., have all their flip-flops share common clock enable
and set/reset logic. In addition, RAM modules formed from primitive LUT RAMs share
common addressing logic amongst the LUT RAMs. Two options exist for optimization in
these situations. The first is to determine, within a particular layer of hierarchy, the sources
of all the clock enables, set/reset logic, and addressing logic. The common sources then
share the same scan instrumentation logic. The second method is to simply maintain a list
of frequently used modules that share common enables and sets/resets for FFs, and address
signals for RAMs. Whenever one of these is encountered during the depth-first circuit
traversal, shared signals are instrumented only once for scan for the whole module. The
first method has greater potential to reduce the overhead of scan than the second method;
for example, in the second method, if a module is used that could share a lot of logic,
but is not on the list of commonly used modules, the optimization will not be performed.
However, the first method is much more difficult to implement correctly than the second
method. For the tests reported in the next section, the second approach was taken.
Lastly, as mentioned previously, many mechanisms exist for doing partial scan.
In the tests reported next, partial scan results are estimated for a number of modules and
designs.
26
Chapter 4
Costs of Scan Chain
4.1 The Costs of Design-Level Scan
This chapter discusses the costs of instrumenting user circuits with scan chains.
Some examples include the extra I/O pins used for the ScanIn, ScanOut, ScanEnable, and
ScanningIn control signals mentioned throughout Chapter 3, as well as the off-chip memory
required to store the scan bitstream when operating in scan mode discussed in Section 3.1.2.
The main concern to a designer, however, is the circuit area and speed overhead of scan.
Full scan in VLSI has reported 5% to 30% area overheads [8, 9, 10]. This chapter will
show that the area and speed overheads of full scan in FPGAs is much greater than this.
4.1.1
Scan for Library Modules
To begin with, consider three modules taken from the JHDL Virtex libraries. These
modules consist of a counter, an array multiplier, and a CORDIC unit. The counter is a
simple 4-bit up-counter, as shown in Figure 4.1. This circuit consists of a registered 4-bit
adder that adds a constant 1 to the count output each cycle. Table 4.1 shows that the counter
normally requires four 4-input LUTs (4-LUTs) and four flip-flops (FFs), as is expected.
Instrumenting the counter for scan involves adding an extra multiplexor for the data input
and an extra OR gate to each FF in the design as explained in Section 3.1.1 (no sets or resets
were used in the counter). Thus, Table 4.1 shows that a total of twelve LUTs is required
to instrument the counter for scan, which makes the circuit times the size of the original.
However, a more optimal approach may be taken since all the FFs share a common clock
enable. The optimized approach, then, is to OR the clock enable with the ScanEnable and
27
D
Q
Out0
Q
Out1
Q
Out2
Q
Out3
Clk En
D
Clk En
4
+1
D
Clk En
D
Clk En
Clk En
Figure 4.1: A 4-Bit Up-Counter
have this single OR gate drive the clock enables for all four FFs, as shown in Figure 4.2.
The result is to reduce the total LUT count to nine, which makes the optimized circuit 657
times as large as the original, as shown in Table 4.2.
The next module in the tables is a 16-bit-by-16-bit, fully-pipelined array multiplier
for which only the upper 16 bits of the product are used. Figure 4.3 shows conceptually
how 16 multiplier cells in each pipelined state are arranged into 16 pipelined stages, with
one bit of input X being multiplied with the entire input Y each clock cycle. The skew
registers shown in the figure allow each bit of X to enter the multiplier cells during the
correct cycle. Figure 4.4 shows a single multiplier cell, which requires two FFs to pipeline
the Y input and the partial product of the multiplier. Since only the upper 16 bits of the
result are used, no deskewing FFs are required for the multiplier’s output. This design
has many more FFs than 4-LUTs due to the skew regististers for X and the two pipeline
registers. As shown in Table 4.1, the number of LUTs required for the normal multiplier
is approximately 98:;" , and the number of FFs used in the multiplier is the sum of
the FFs for the skew register and the two pipeline registers, for a total of 6& FFs. The
28
0
1
ScanIn
D
Q
Out0
Q
Out1
Q
Out2
Q
Out3
(ScanOut)
Clk En
ScanEnable
0
1
4
+1
D
Clk En
ScanEnable
0
1
D
Clk En
ScanEnable
0
1
D
Clk En
ScanEnable
Clk En
ScanEnable
Figure 4.2: A 4-Bit Up-Counter Instrumented for Scan
Table 4.1: Design-Level Scan Costs for a Few Modules without Optimizations
Module
Num of
Normal
Scan w/o optimizations
FFs
LUT Speed
LUT LUT Speed Speed
Count (MHz) Count Ratio (MHz) Ratio
cnt
4
4
165.67
12
3
137.01 0.83
mult
615
270 103.63 1470 5.44 85.87
0.83
cordic
768
780
75.79
2301 2.95 63.42
0.76
averages
3.80
0.81
Table 4.2: Design-Level Scan Costs for a Few Modules with Optimizations
Module
Num of
Normal
Scan w/ optimizations
FFs
LUT Speed
LUT LUT Speed Speed
Count (MHz) Count Ratio (MHz) Ratio
cnt
4
4
165.67
9
2.25 131.56 0.79
mult
615
270 103.63
871
3.23 85.32
0.82
cordic
768
780
75.79
1596 2.05 60.75
0.80
averages
2.51
0.80
29
X0
X1
X15
Skew Registers
Y0
16 Pipelined Stages
Y1
16 Cells
Y15
Figure 4.3: Conceptual View of a 16x16 Array Multiplier
X
Y
PPin
D
x/+
Q
Y_Delay
Q
PPout
Clk En
PPout
D
Clk En
PPin = Partial Product In
PPout = Partial Product Out
Clk En
Figure 4.4: A Single Multiplier Cell
30
unoptimized scan insertion approach adds a multiplexor and an OR gate to every FF. This
requires <=>?@ additional 4-LUTs for a total of ; ( 65A times the number in
the original circuit). The optimized version shares one OR gate for the clock enable for
all FFs, as illustrated in Figure 4.5. (Although the skew registers for X are not shown in
the figure, they are instrumented for scan similar to the manner shown in the example in
Figure 3.4). In this case, the design has only 657 times as many LUTs as the original.
X
Y
PPin
x/+
ScanIn
PPout
0
1
D
Q
Y_Delay
Q
PPout
(ScanOut)
Clk En
ScanEnable
0
1
PPin = Partial Product In
D
Clk En
ScanEnable
PPout = Partial Product Out
Clk En
ScanEnable
Figure 4.5: A Single Multiplier Cell Instrumented for Scan
The last module in Tables 4.1 and 4.2 is a 16-bit, fully-pipelined rotational CORDIC
unit containing 15 iterations, with each iteration providing one more bit of accuracy to the
output. Since the CORDIC is unrolled to make it fully pipelined, each iteration corresponds
to a stage, as shown in Figure 4.6. Each stage of this design consists of three registered
adders/subtractors, as shown in Figure 4.7. Note that the shifters in the figure simply reorder the input signals, and do not contain any logic. This design is quite different from
the multiplier, since the number of LUTs and FFs is roughly the same. Figure 4.8 shows
the optimized version of scan for the CORDIC unit, with a single OR gate controlling the
31
Xout
Xin
15 Pipelined Stages
Yin
Yout
Zin
Zout
Figure 4.6: A Fully-Pipelined Rotational CORDIC Unit
16
Xin
+/−
16
Shift
Xout
Q
Clk En
Shift
+/−
16
16
D
16
16
D
Yin
Yout
Q
Clk En
16
Zin
+/−
16
16
D
Constant
Clk En
Clk En
Figure 4.7: One CORDIC Stage
32
Q
Zout
clock enables on all the FFs. The three “FF Scan Chain” blocks shown in the figure simply
consist of a bank of sixteen FFs hooked up to a scan chain, in the same manner as the
example shown in Figure 3.4 (but without the OR gate). As can be seen in Table 4.2, the
result is a design with a little more than twice the number of LUTs as the original.
16
16
ScanIn
16
Xin
D
ScanIn
ScanEnable
Clk En
Xout
Q
ScanOut
FF Scan Chain
+/−
Shift
16
16
D
ScanIn
ScanEnable
Clk En
Shift
+/−
16
Yin
Yout
Q
ScanOut
FF Scan Chain
16
Zin
+/−
16
16
D
ScanIn
ScanEnable
Clk En
Constant
Q
ScanOut
Zout
ScanOut
FF Scan Chain
ScanEnable
Clk En
Figure 4.8: A Fully-Pipelined Rotational CORDIC Unit Instrumented for Scan
Two important notes about the increase in LUTs for scan logic are in order. First,
some extra LUTs in addition to those used for the multiplexor and logic gates may be
required when instrumenting scan for place and route purposes. For example, if many FFs
to be scanned are packed close together in the original circuit, the vendor’s place and route
tools may require some extra LUTs to help route the scan multiplexors to the appropriate
FFs. Second, sometimes the increase in LUTs for scan is lower than expected since some
of the scan instrumentation logic is folded into existing 4-LUTs from the user circuit. This
takes place when 4-LUTs from the original design have unused inputs that can be used
by some of the instrumented scan logic. To illustrate, Figure 4.9 shows that if the logic
generating the clock enable signal uses two inputs of a 4-LUT, the Scan Enable signal can
33
be routed to a third input of the LUT so that the OR gate does not require an additional
LUT. Section 4.1.4 will show what kind of impact logic packing has on scan for several
designs.
4−LUT
4−LUT
Global Clock Enable
Global Clock Enable
Reset
Reset
Clk En
Clk En
Scan Enable
Figure 4.9: Folding Scan Logic Into Existing 4-LUTs
Notice from Tables 4.1 and 4.2 the speed penalties suffered by the three library
modules when instrumented for scan. On average, these circuits can operate at only about
BCD
of the original frequency with the scan logic is inserted. The effect of this hit depends on the application; some circuits have a minimum operating speed that must be met,
whereas for other circuits, the operating speed is not as important. Notice as well that,
as scan is optimized for area, the speed at which the circuit may operate can be reduced.
Thus, sometimes trading off increased area for increased speed is necessary, depending on
the application. Since scan is being used for functional verification of the circuit logic and
will be removed once the design is verified, this work will not place much emphasis on the
speed penalty incurred by scan.
The area overheads in Tables 4.1 and 4.2 were determined by counting the number
of additional 4-LUTs used for scan instrumentation. However, this number often does not
accurately reflect the increase in area of the circuit. To illustrate, a Virtex slice contains
two FFs with two corresponding 4-LUTs. If the FF in a particular slice is being used by
a particular design but its corresponding 4-LUT is left unused, that unused 4-LUT can be
used for the scan multiplexor without increasing the number of slices in the design. In
other words, an increase in the number of 4-LUTs does not necessarily mean an increase in
34
circuit area. Thus, a better area metric to use is the logic element (LE), which consists of a
single 4-input LUT, carry logic, and a FF. LEs are the basic building blocks of most modern
SRAM-based FPGAs, and using it as a metric more acurately shows the area overhead of
the design, as it takes into consideration the scan logic that fits into partially filled LEs
without increasing the area of the circuit.
Table 4.3: Design-Level Scan Costs for a Few Modules—LUT vs. LE Costs
Module
Normal
Scan w/o opt
Scan w/ opt
LUT
LE
LUT LUT LE
LE
LUT LUT LE
LE
Count Count Count Ratio Count Ratio Count Ratio Count Ratio
cnt
4
4
12
3
12
3
9
2.25
9
2.25
mult
270
630
1470 5.44 1485 2.36
871 3.23 871 1.38
cordic
780
812
2301 2.95 2316 2.85 1596 2.05 1596 1.97
averages
3.80
2.74
2.51
1.87
Table 4.3 provides an LE-based comparison of circuit area overheads for the counter, multiplier and CORDIC. As mentioned previously, the counter and the CORDIC use
roughly the same number of FFs as 4-LUTs in their designs; as such, most of their LEs
are already full before the design is instrumented for scan. So in their case, increasing the
number of 4-LUTs also increases the number of LEs by roughly the same amount, since
there aren’t any partially filled LEs to place the scan logic in. Hence, the table shows that
the overhead in these two cases is about the same whether it is measured in 4-LUTs or
LEs. In contrast, the multiplier has many more FFs than LUTs due to the many pipeline
registers, so much of the scan logic can be packed into the partially filled LEs left by these
B
pipeline FFs. Thus the table shows that the multiplier is really only EFHG times as large
when instrumented with scan when the area is measured in terms of LEs, as opposed to
G>FHIG times as large in terms of LUTs. Coincidentally, for these three modules the LE count
and LUT count are the same for optimized scan since the LE count is LUT dominated.
This will not necessarily be the case for all designs. However, this table shows that by
35
measuring the overhead in terms of LEs instead of 4-LUTs, a more accurate view of the
scan area overhead is seen, which is not quite as large as originally believed.
In short, designs that have a high FF-to-LUT ratio allow packing of scan logic into
partially filled LEs, thus reducing the area overhead incurred by scan. This is particularly
evidenced in the multiplier, where the pipeline registers leave many LEs only half-full.
However, having a high FF count can significantly increase the area of the circuit when
instrumenting scan, particularly when most of the LEs are full. In the case of these modules,
doing full scan on average nearly doubles the number of LEs used by the circuit. For
designs that are dominated mostly by combinational logic, however, fewer memory cells
need to be scanned, so the area growth is much smaller.
4.1.2
Partial Scan
In recent years, partial scan techniques have been extensively researched and de-
veloped as a method to reduce the area overheads associated with scan chains. A common
approach is to scan only portions of the circuit deemed important for debug. For example, the approach done here is to scan only the input and output registers of certain library
modules. When the module has already been well tested and verified, only the state of the
input and output registers of the module is necessary for verifying the rest of the circuit;
the internal state of the module is not necessary since it is already well-known.
Consider, for example, the array multiplier shown previously in Figure 4.3. If this
module has already been extensively verified, the partial scan approach is to scan the final
output registers and to use separate “shadow registers” to capture the inputs. The pipeline
registers are not included in the scan chain since the behavior of the multiplier is already
well known. However, the FFs not included in the scan chain must be disabled during scan
so that their state isn’t modified. One approach is to add an extra AND gate to such FFs, as
shown in Figure 4.10, so that the state of these FFs does not change during scan. However,
most of the FFs in the multiplier will require this AND gate since they will not be included
in the scan chain. A better approach in this case, then, is to reduce the overhead by using
this AND gate to disable the entire multiplier during scan. A single OR gate can then be
used for the FFs in the final output register to enable them during scan. A conceptual view
36
D
D
ScanEnable
Clk En
Q
Q
Clk_En
Clk
Figure 4.10: Disabling Unscanned FFs for Partial Scan
of this is shown in Figure 4.11. The three “FF Scan Chain” blocks each consist of a bank
of sixteen FFs hooked up to a scan chain, in essentially the same manner as the example
shown in Figure 3.4, but without the OR gate for the clock enable. The two “FF Scan
Chain” blocks shown by dotted lines in Figure 4.11 represent the two shadow registers, and
the “FF Scan Chain” block on the right represents the final output registers of the multiplier.
16
ScanIn
ScanEnable
D
ScanIn
ScanEnable
Clk En
ScanOut
Q
D
ScanIn
ScanEnable
Clk En
ScanOut
FF Scan Chain
16
ScanEnable
Q
FF Scan Chain
X
Y
Clk En
ScanEnable
16
16
MULT
X
16
PPout
Y
ScanEnable
Clk En
Clk En
ScanEnable
D
ScanIn
ScanEnable
Clk En
Q
16
ScanOut
FF Scan Chain
Figure 4.11: Conceptual View of Partial Scan for the Array Multiplier
37
Product
ScanOut
Table 4.4: Design-Level Scan Costs for a Few Modules—Full vs. Partial Scan
Module
Normal Scan w/o opt
Scan w/ opt
Partial Scan (est.)
LE
LE
LE
LE
LE
LE
LE
Count Count Ratio Count Ratio Count
Ratio
cnt
4
12
3
9
2.25
N/A
N/A
mult
630
1485 2.36
871
1.38
680
1.08
cordic
812
2316 2.85
1596 1.97
894
1.14
averages
2.74
1.87
1.11
Table 4.4 compares full scan costs with estimated partial scan costs. For the multiC
plier, the cost is about J LEs— E LE to disable the multiplier during scan, GI LEs to scan
the two 16-bit input shadow registers, EK LEs for the scan multiplexors for the 16-bit output register, and E LE for the OR gate for the output register—resulting in approximately
BD
overhead. The same approach is taken for the CORDIC unit, in that shadow registers
are used to capture the inputs, the final output registers are connected to the scan chain, a
single AND gate is used to disable the CORDIC during scan, and a single OR gate is used
B
B
to enable the output FFs during scan. The overhead is approximately I LEs— L LEs to
scan the three 16-bit input shadow registers, I LEs for the single AND and OR gates, and
GI LEs for the scan multiplexors for two of the output registers—the Zout for rotational
CORDICs and the Yout for vectoring CORDICs need not be scanned since it has a value of
D
D
0. This amounts to about EL
overhead, which is significantly less than the MN
overhead
reported for optimized full scan.
As shown by these two examples, when partial scan is used for some large modules
such as pipelined multipliers, CORDIC units, and other large datapath elements, the overhead is greatly reduced. However, for some library modules, such as counters, all of the
FFs need to be scanned to get the state of the counter, so partial scan does not provide any
advantages in overhead.
38
Design
Eigenray BF
Low-power BF
CDI
Superquant
Table 4.5: Area and Speed of Sample Designs
Original
FF
BlockRAM LUT RAM Total LUT
Count
Count
Count
Count
2216
0
67
1775
738
30
1935
14559
4478
18
40
5738
4890
0
3658
11806
LE
Speed
Count (MHz)
2658 14.35
14719 3.70
6675 31.28
14087 29.46
Table 4.6: Design-Level Scan Costs for Sample Designs
With Optimized Full Scan
Design
FF
FF
LUT LUT
LE
LE
Speed
Count Ratio Count Ratio Count Ratio (MHz)
Eigenray BF
2222 1.00 3413 1.92 3445 1.30
9.62
Low-power BF 1307 1.77 24245 1.67 24391 1.66
N/A
CDI
5455 1.22 12812 2.23 13434 2.01 25.26
Superquant
4896 1.00 32192 2.73 32192 2.29
N/A
averages
1.25
2.14
1.82
4.1.3
Speed
Ratio
0.67
N/A
0.81
N/A
0.74
Scan for Large Designs
Up to this point, the examples in this chapter have dealt only with small JHDL
library modules. To determine the cost of scan for complete designs, several large JHDL
designs available at BYU were instrumented for scan. The area and speed costs of the
original designs are shown in Table 4.5 while the cost of instrumenting these designs with
optimized full scan are shown in Table 4.6. The BlockRAM count shown in the first table
is for the fully synchronous RAMs used in the designs, and the LUT RAM count is for the
RAMs with asynchronous writes and synchronous reads, as described in Chapter 3.
The first design in the tables, Eigenray BF, is a sonar beamformer that does matched
field processing. It is heavily pipelined and has a significant data path consisting of 2
CORDIC units, 5 multipliers, and other logic. Since it is the only XC4000 design in the
table (the rest are Virtex), to keep the comparisons of the different designs consistent, only
the 4-LUTs are reported in the table (the H-LUTs for the XC4000 design are ignored). In
39
addition, the 4-LUTs are what dominate the area of the design and are comparable to LEs in
understanding the area overhead of scan. The second design, Low-power BF, is described
in [11]. It includes a 1024-point FFT unit and an acoustic beamformer, and is similar to the
Eigenray BF in that it has many datapath modules (CORDICs, multipliers, etc.). However,
due to power constraints, this design is not pipelined at all. The final two designs are
related to automatic target recognition and differ from the first two since they are more
control intensive rather than data path intensive. CDI is a form of the design reported in
[12] whose function is to perform histogramming and peak finding. The Superquant design
performs adaptive image quantization to optimally segment images for target recognition.
All four of these designs factor in the cost of instrumenting RAMs in addition to FFs; as
shown in Table 4.5, Low-power BF and CDI use a significant number of BlockRAMs (only
32 are on a chip) and Low-power BF and Superquant use a great deal of LUT RAMs of
various types (32x1, 16x1, dual-ported 16x1, etc.).
For Eigenray BF, note in Table 4.6 that, although the LUT count nearly doubles
CD
. This is because, in the
when it is instrumented for scan, the LE overhead is only G
original design, the LE count is dominated by flip-flops. Instrumenting the design nearly
doubles the number of LUTs required, but many of those LUTs were able to be absorbed
into LEs which previously contained only flip-flops. In addition to the cost of instrumenting
FF scan logic, some of the other overhead comes from instrumenting the LUT RAMs in
the design. Recall from Section 3.1.1 that each LUT RAM requires additional LUTs for
multiplexing the data and the address inputs, an OR gate for the write-enable, and some
logic for the address generator. For a standard 16X1 LUT RAM, this results in a cost of M
LUTs for the address generator (which is paid only once since it is shared by all the LUT
RAMs) and K LUTs for the multiplexors and OR gate for each 16X1 LUT RAM. Since
this design uses relatively few LUT RAMs, this overhead is fairly small. The rest of the
overhead can be explained by Section 4.1.1 in that some extra LUTs are required for routethrough. In addition, some of the scan logic is packed into already existing 4-LUTs, which
actually reduces the overhead of scan. Finally, as can be seen by the FF counts for this
design in Table 4.6, there is a small overhead of K FFs required to instrument the address
generator for the LUT RAMs. Compared to the number of FFs used by the original circuit,
40
this number is negligible.
The analysis of some of the other designs in the table is even tricker since they
use synchronous BlockRAMs. Before the analysis can be done, consider the LUT and FF
overhead of instrumenting BlockRAMs for scan as shown in Figures 4.12 and 4.13, respectively. The bottom curve on the graphs show the cost of instrumenting a single-ported
150
O
4-LUT Overhead
140
130
120
110
100
90
80
1
2
4
Data Width
8
16
Single-Ported BlockRAM
Dual-Ported BlockRAM (Lower Bound)
Dual-Ported BlockRAM (Upper Bound)
Figure 4.12: LUT Scan Overhead for Instrumenting a Single Virtex BlockRAM
80
Flip-Flop Overhead
70
60
50
40
30
20
10
1
2
4
8
16
Data Width
Single-Ported BlockRAM
Dual-Ported BlockRAM (Lower Bound)
Dual-Ported BlockRAM (Upper Bound)
Figure 4.13: Flip-Flop Scan Overhead for Instrumenting a Single Virtex BlockRAM
41
BlockRAM (which is instrumented by first converting it to a dual-ported BlockRAM). As
the data width of the BlockRAM increases, so does the number of 4-LUTs and FFs needed
to instrument it in the form of extra feedback multiplexors and larger converters for the
inputs and outputs, as explained in Section 3.1.1. The middle curve is the lower bound
curve for the dual-ported BlockRAMs, and represents the case where both data ports are of
the specified width. The top curve, or upper bound curve for the dual-ported BlockRAMs,
represents the case when data port A of the BlockRAM is of the specified width, while
data port B is at the maximum width of 16. The overhead required to instrument a dualported BlockRAM thus falls somewhere between these two bounds. Note that these graphs
show the cost of instrumenting a single BlockRAM; some of this overhead is in the form
of control logic and can be shared if multiple BlockRAMs are instrumented. (Approximately IPL LUTs and EG FFs is shared by multiple BlockRAMs.) Nevertheless, the cost of
instrumenting BlockRAMs is very significant.
With this in mind, consider the next design in Tables 4.5 and 4.6—Low-power BF.
As can be seen in Table 4.5, the LUT count and LE count for the design are almost the
same, so the LE count is dominated by LUTs as opposed to FFs. Thus, not much of the
scan logic will be able to be packed into partially filled LEs like it was for Eigenray BF,
D
D
LUT overhead and KK
LE overhead. A relatively
which is reflected by the similar KN
small portion of this overhead comes from the FFs, since the number of FFs is small compared to the number of LUTs in the original design. The majority of the overhead comes
C
from the LUT RAMs and BlockRAMs found in the design. The design uses G dual-ported
BlockRAMs with 4-bit wide data ports (ports A and B). Figure 4.12 shows the scan overC
head per BlockRAM to be about E J LUTs, although approximately IJ of those LUTs can
BC
LUTs per BlockRAM.
be shared by all the BlockRAMs, leaving an overhead of about
C
CVUWBCXZY
I'L[IJ LUTs, or
So the LUT overhead for the G BlockRAMs is about IJRQTS9G
D
approximately E&N . Most of the rest of the overhead for this design, however, is caused
by the relatively high number of LUT RAMs used in the design. At first glance it would
U
CC
appear that the overhead for the LUT RAMs is about K
LUTs since a
EMGJ]^
\ EE_`K
16x1 LUT RAM typically requires K LUTs to instrument for scan. However, most of these
LUT RAMs were combined in the design to form larger RAMs; hence, much of the scan
42
logic can be shared.
The FF overhead for this design is pretty high due to the use of G
C
BlockRAMs in
the circuit. As can be seen in Figure 4.13, the FF overhead is approximately G>E FFs per
B
BlockRAM. EG of these FFs can be shared by all the BlockRAMs, so about E FFs are
C+U BX
actually used by each BlockRAM. So the FF overhead is about EG2QaS!G
E Q (# of FFs
D
FF overhead. Fortunately, in this design,
for the address generator), which results in a NN
most of these extra FFs are packed into already existing LEs.
CDI can be analyzed similarly. With CDI, the design has a large number of FFs
relative to LUT RAMs and BlockRAMs, so most of the scan overhead comes from instrumenting the FFs. Although the exact overhead numbers for the FFs are unknown, just the
CC
LUTs. The second
scan multiplexors for the FFs alone require an increase of about L>_bJ
largest source of overhead comes from the use of 18 dual-ported BlockRAMs with 16-bit
data ports in the design. According to Figures 4.12 and 4.13, this translates into a cost
C
BC
of about E&J LUTs and
FFs per BlockRAM (minus the overhead that is shared by the
BlockRAMs). The LUT RAMs, however, do not have much of an impact with scan in this
C
circuit, as there are only L LUT RAMs in the entire design. CDI has a high area overhead,
D
C D
with a E&IG
increase in 4-LUTs, or a E E increase in LEs.
Although Superquant does not use any BlockRAMs, it does use a large number of
FFs and LUT RAMs in the design, giving it the highest overhead of all four of the designs
in the table. Although the FFs contribute greatly to the scan overhead, the greatest amount
of overhead is contributed by the LUT RAMs. This overhead is mostly due to the problem
discussed in Section 3.1.4, where the module used to group primitive LUT RAMs to form
a larger RAM is not found on the scan instrumenter’s list of commonly used modules.
As such, the scan instrumenter has no way of knowing which LUT RAMs share common
addressing logic, so each LUT RAM incurs the full penalty of K LUTs when instrumented
D
D
for scan. The price paid is a E&NG
LUT increase, or a E&I'M
LE increase.
In Table 4.6 the speed penalty incurred by scan for these designs is provided. For
D
Eigenray BF and CDI the average circuit speed after instrumenting scan is only N'L
of the
original speed. For Low-power BF and Superquant, no speed data is given for scan. This is
because instrumenting scan actually caused these circuits to be too large to fit the FPGA!
43
This is clearly a problem; some possible solutions include using a larger FPGA for debug
(provided it has the same pin out as the smaller FPGA), implementing better optimization
techniques such as the first of the resource sharing techniques described in Section 3.1.4,
or using partial scan techniques.
Partial scan was described in Section 4.1.2 and estimated results were given for the
multiplier and CORDIC unit. To give an example of how partial scan might help for a larger
circuit, consider the Eigenray BF case, which contains, among other logic, 2 CORDIC units
and 5 array multipliers of varying sizes. To estimate the cost of using partial scan, it is
assumed that the circuit is instrumented for full scan with the exception of the multipliers
and CORDICs, which are partially scanned as described in Section 4.1.2. By actually
instrumenting the Eigenray BF with full scan without instrumenting the CORDICs and
multipliers, and then adding to the resulting LE count the estimated partial scan costs of the
CORDICs and multipliers, it is estimated that a partially scanned Eigenray BF would cost
C
D
G>_`G K LEs, which is a IPL
overhead over the original non-scan design—an improvement
CD
over the G
overhead required for full scan. Though not conclusive, these results are
encouraging and suggest that partially scanning library elements can reduce scan overhead
while giving up little visibility.
To sum up these results, it is clear that LE counts are a more accurate method
of determining the actual area overhead for instrumenting scan than LUT counts because
many LUTs added by scan can be packed into partially filled LEs. However, LUT counts
can be useful to get an idea as to how much logic was actually instrumented for scan.
Also, RAMs are extremely costly to instrument for scan—a dozen or more BlockRAMs
can literally require thousands of extra LUTs to instrument; LUT RAMs are also very
expensive, especially when they are not optimized for scan and pay the full penalty of scan.
In addition, scan slows down the speed of the circuit when running in normal operation.
This can be a problem for circuits required to run at or above certain frequencies. Lastly,
it is clear that scan can make large circuits too big to fit on the FPGA. To get around this,
either a larger FPGA needs to be utilized or some techniques must be used to decrease the
overhead of scan.
44
4.1.4
Packing Scan Logic into Existing LUTs
As was previously illustrated in Figure 4.9, one effect of instrumenting scan is the
packing of instrumentation logic into already existing 4-LUTs used by the original circuit.
If most of the additional scan logic gets packed into already existing LUTs, there won’t be
much area overhead. Unfortunately, Table 4.7 shows that little of the scan logic actually
gets packed into already existing LUTs. The table shows for a variety of Virtex designs after
they have been instrumented with scan: (1) the percentage of LUTs containing the original
user logic (including user LUTs that have been packed with scan logic), (2) the percentage
of LUTs containing only scan circuitry—i.e., LUTs purely containing scan overhead, and
(3) the percentage of the total scan logic that was packed into already existing LUTs—
that is, scan logic that does not affect the area of the circuit in any way. As can be seen
from the table, some designs, such as Cnt and Cordic, do not have any scan logic packed
into existing LUTs. This means that these designs pay the full LUT overhead for scan.
D
of its scan logic packed into already existing LUTs,
CDI, on the other hand, has over EG
which reduces the actual scan overhead for the design. The average for these designs is to
D
of the scan logic packed into already existing LUTs. Although this number
have only J
does not take into account the scan LUTs that do not take up any extra area since they are
packed into partially filled LEs, this table shows that user designs come close to paying the
full overhead for scan, with very little of the scan logic being packed into already existing
4-LUTs.
4.1.5
Using Dedicated Scan Multiplexors
Due to the heavy use of flip-flops in many designs, much of the overhead from
instrumenting scan chains comes from adding the scan multiplexors to the flip-flops, as described in Section 3.1.1. One way to reduce this overhead would be for the FPGA vendor
to provide dedicated scan multiplexors for each flip-flop. For example, in the Xilinx Virtex
technology, some dedicated multiplexors exist, such as the library primitives muxf5, muxf6,
and muxcy. These multiplexors add additional logic to a Virtex slice or CLB beyond that
provided by the 4-LUTs: the muxcy is used for carry logic and the muxf5 and muxf6 are used
to provide additional logic, such as 4:1 and 8:1 multiplexors, without consuming extra LUT
45
Design
Cnt
Mult
Cordic
Low-power BF
CDI
Superquant
averages
Table 4.7: Logic Packing for Sample Designs
% LUTs Containing % LUTs Containing
% of Scan Logic
User Logic
Only Scan Logic
Packed Into User LUTs
44.4%
55.6%
0.00%
29.8%
70.2%
5.88%
48.4%
51.6%
0.00%
59.5%
40.5%
1.75%
44.6%
55.4%
13.22%
34.9%
65.1%
9.60%
43.6%
56.4%
5.08%
resources. If Xilinx and other FPGA vendors added a dedicated primitive such as muxscan, this multiplexor could be used instead of 4-LUTs to provide the scan muxes for FFs
and RAMs without consuming valuable LUT resources. Table 4.8 estimates that the LUT
and LE overheads would be significantly reduced by using a dedicated scan multiplexor.
Since such a multiplexor does not exist, the results in the right-hand column are based on
instrumenting each design with full scan, with the exception that it does not include the
scan multiplexor with each FF. The assumption is that a dedicated scan multiplexor will
be used instead, which will not use up any extra LUTs to instrument the multiplexor. It
D
D
shows that the average LUT overhead associated with scan is reduced from EEL
to KJ ,
B D
BD
to J
. Note that this optimization works best
and the LE overhead is reduced from I
in designs that use many FFs; for designs with relatively few FFs, such as Low-power BF,
the difference is minimal.
Another similar approach is to use already existing dedicated multiplexors to instrument scan. The Xilinx Virtex primitives muxf5 and muxf6 can potentially be used for
this purpose. The key is to use one of these primitives as the scan multiplexor if it is unused in the particular slice or CLB by the original design; otherwise, a regular 4-LUT must
be used. This will potentially reduce the number of 4-LUTs consumed in the process of
instrumenting scan. However, this implementation does have several drawbacks. First, it is
very difficult to implement—since scan instrumentation logic is added prior to netlisting,
there is no way of knowing which slices and CLBs the muxf5s and muxf6s will mapped to.
46
Table 4.8: Design-Level Scan Costs Using a Dedicated Scan Mux
Design
Normal
Full Scan
Full Scan w/ scan-mux
LUT LE
LUT LUT LE
LE
LUT LUT LE
LE
Count Count Count Ratio Count Ratio Count Ratio Count Ratio
Eigenray BF
1775 2658 3413 1.92 3445 1.30 1997 1.13 2855 1.07
Low-power BF 14559 14719 24245 1.67 24391 1.66 23722 1.63 24173 1.64
CDI
5738 6675 12812 2.23 13434 2.01 8704 1.52 10417 1.56
Superquant
11806 14087 32192 2.73 32192 2.29 27423 2.32 28642 2.03
averages
2.14
1.82
1.65
1.58
To illustrate, if a muxf5 is used as the scan multiplexor of a FF in a slice where the muxf5 is
already in use, the muxf5 used for scan will have to be placed in an entirely different slice,
which adds an entire additional slice to the overhead of scan instead of only an additional
4-LUT for the scan multiplexor. Second, since a slice can have two FFs but only one muxf5,
if both FFs in the slice are being used, the muxf5 can only provide the scan logic for one
of the FFs. Third, the muxf6 and muxcy have limited access—the muxf6 can only use the
outputs of a muxf5 as its inputs, and the muxcy can essentially only be used as carry logic.
If the FPGA vendor made these multiplexors more accessible for scan use, they could be
used as scan multiplexors in place of 4-LUTs.
Naturally, some tradeoffs exist for implementing either of these two approaches.
For instance, additional silicon area would be required, whether adding dedicated scan
multiplexors to each slice or improving accessibility to existing dedicated muxes. However,
a dedicated scan multiplexor requires far fewer transistors than instancing an entire 4-input
LUT for scan. Also, the dedicated multiplexor can be optimized in silicon to increase the
speed of the circuit when instrumented with a scan chain. However, if the particular design
does not use many flip-flops or if the dedicated scan multiplexors are infrequently used, the
overall extra silicon overhead on the chip may not be worth the cost. In addition, having
a dedicated multiplexor in silicon may cause normal designs that don’t use scan to run at
reduced speeds.
47
4.1.6
Other Cost Issues
As mentioned earlier in the chapter, full scan for VLSI has reported overheads on
the order of 5–30%. The costs for implementing full scan for FPGAs as reported in this
chapter are far greater than this. An obvious question, then, is why scan costs so much more
for FPGAs than it does in VLSI? The solution lies in the granularity of the devices used
for implementing scan logic—transistor logic costs much less than FPGA LUT logic [13].
CD
For example, [10] claims that a D flip-flop instrumented for scan is only E
larger in area.
In an FPGA design, however, a FF uses half of an LE. Thus, instrumenting a FF for scan
effectively doubles its size, since the scan multiplexor for the FF requires a LUT, which
is also half an LE. The size may be tripled or even quadrupled by using additional LUTs
for the clock enable and set/reset scan logic. Section 6.1.2 will discuss scan overheads in
FPGAs versus VLSI a little further.
This chapter has suggested a few techniques such as partial scan and using special
dedicated scan multiplexors to help lower the cost of scan in FPGA devices. Chapter 5
will also discuss some ideas to help alleviate these costs for the purpose of increasing the
observability and controllability of user designs. Another potential solution is to have the
FPGA vendors provide scan-specific primitives. For example, if a FF design primitive
existed that has extra inputs for ScanEnable, ScanIn, and ScanOut, when the user design
is instrumented for scan, each FF would be replaced by one of the scan FF primitive that
is already optimized for scan. That way, no extra LUTs would be used to implement scan,
and the overhead would be a little bit of transistor logic used to create the primitive that
would have similar overheads for scan as those shown in VLSI.
48
Chapter 5
Variations of and Alternatives to Scan
5.1 Supplementing Existing Observability and Controllability
Thus far, full scan has been proposed as a method for providing full observability
and controllability to provide complete functional verification on all types of FPGAs. Full
scan is often necessary for providing this capability since many FPGA vendors, such as Altera and Cypress, have neither built-in observability nor controllability features. However,
many FPGAs, such as those produced by Xilinx, Lucent, and Atmel, are equipped with
limited capability to read or modify the state of a circuit. For example, Xilinx XC4000 and
Virtex FPGAs can partially reconfigure the FPGA at run-time, thus providing controllability for the state of the embedded RAMs. However, although the state of embedded RAMs
is controllable, the state of the flip-flops can only be controlled when the Global Set/Reset
(GSR) is asserted at the beginning of hardware execution; after that, controlling the state of
the flip-flops is impossible. In addition, some FPGAs have the ability to capture the state of
the circuit through readback, as explained in Section 2.1.3, but this feature is also limited.
For example, the state of the output registers on the synchronous BlockRAMs in Xilinx
Virtex FPGAs is modified during readback, which invalidates the state of the circuit.
One method of achieving full circuit observability and controllability without paying the high overhead of full scan is to take advantage of already existing observable and
controllable features of FPGAs. Scan and other related techniques can then be used to
overcome the shortcomings of those features.
This chapter discusses variations and alternatives to full scan that supplement existing FPGA debug features to provide complete controllability and observability of the user
49
circuit without paying the full cost of full scan. It uses circuit designs on Xilinx Virtex and
XC4000 FPGAs to illustrate. Although this chapter generally addresses controllability issues separately from observability issues, these issues are often related. For example, using
partial reconfiguration to control the circuit state requires the FPGA to be reconfigured on
a frame-by-frame basis. Thus, modifying any portion of the state of an FPGA requires the
entire frame to be read and modified. The sequence of events to modify the FPGA state
is to (1) read the state of the frame being modified, (2) modify the desired portion of the
frame, and then (3) write the frame back into the bitstream. Hence, circuit controllability
is dependent on circuit observability.
5.1.1
Strategies to Increase Controllability
One of the purposes of scan is to provide the ability to bring the circuit to a known
state during debug. Some FPGA vendors, such as Xilinx, already have the ability to externally modify the state of their embedded RAMs—the LUT RAMs and BlockRAMs—
through partial reconfiguration. However, the state of the flip-flops cannot be modified
externally; since FFs are widely used in many designs, this makes the controllability of
Xilinx FPGAs very limited.
One option to provide complete controllability of Xilinx FPGAs is to use the builtin partial-reconfiguration features to control the state of the LUT RAMs and BlockRAMs,
and to use scan to control just the flip-flops. The area overhead for this method consists
of the cost of instrumenting the flip-flops for scan and a minimal amount of extra logic
required to disable all other memories to preserve their state during scan.
Table 5.2 compares the overhead of this approach to the overhead of full scan for the
same large JHDL designs mentioned in Chapter 4. Table 5.1 contains the original flip-flop
count, BlockRAM count, LUT RAM count, LUT count, and LE count for these designs.
The numbers in Table 5.1 can also be found in Table 4.5 from the previous chapter, and
have been duplicated here for convenience to aid in the discussion.
As can be seen in Table 5.2, the reduction in LUT overhead for Eigenray BF when
only the FFs are instrumented for scan when compared with full scan is actually very
D
B D
LUT overhead down to an K
LUT overhead. The LE overhead
small, going from a MI
50
Design
Eigenray BF
Low-power BF
CDI
Superquant
Table 5.1: Area of Sample Designs
Original
FF
BlockRAM LUT RAM Total LUT
Count
Count
Count
Count
2216
0
67
1775
738
30
1935
14559
4478
18
40
5738
4890
0
3658
11806
LE
Speed
Count (MHz)
2658 14.35
14719 3.70
6675 31.28
14087 29.46
Table 5.2: Design-Level Scan on Only Flip-Flops
Design
Full Scan
Scan only flip-flops
LUT LUT
LE
LE
LUT LUT
LE
LE
Count Ratio Count Ratio Count Ratio Count Ratio
Eigenray BF
3413 1.92 3445 1.30
3306 1.86 3427 1.29
Low-power BF 24245 1.67 24391 1.66 16035 1.10 16035 1.09
CDI
12812 2.23 13434 2.01 10945 1.91 10945 1.64
Superquant
32192 2.73 32192 2.29 20342 1.72 20342 1.44
averages
2.14
1.82
1.65
1.37
also went down only slightly from G
CD
to IM
D
. The reason for the small improvement is
because most of the memory elements in the design are FFs; there are relatively few RAMs
in the design, as is shown in Table 5.1. Thus, instrumenting only the FFs for scan in this
design results in almost the same circuitry as instrumenting the design for full scan, so the
overheads are about the same.
KL
D
C D
CDI tells a similar story, with the LE overhead of scan going from E E
down to
. This design is also dominated by FFs rather than embedded RAMs, so much of the
design must be instrumented for scan. This is still a greater reduction in overhead than for
the Eigenray BF case, though, due to CDI’s heavy use of BlockRAMs. Chapter 4 showed
the costs of instrumenting BlockRAMs for scan to be extremely high, so not including them
in the scan chain achieves a significant improvement in area overhead.
Low-power BF and Superquant both have the greatest savings in overhead when
D
scanning only the FFs, with the LE overhead for Low-power BF going from KK
down to
D
D
D
down to LL . Both of these
M , and the LE overhead for Superquant going from EIM
51
designs have a large number of embedded RAMs, so not instrumenting these RAMs for
scan leads to a great savings in area overhead. In addition, since Low-power BF does
not use very many FFs in its design, the new overhead for the design is relatively small,
D
indeed—only M .
By instrumenting only the FFs for scan in these designs, the average LE overhead
B D
D
to GN —a significant savings. The price of the three pins being
was reduced from I
used for scan control is still paid, but this partial scan approach is certainly a far more
reasonable solution than performing a full scan. When vendors fail to provide complete
circuit controllability, scan may be the only method available for controlling the state of all
the memory elements.
5.1.2
Strategies to Increase Observability
In addition to providing some controllability features, Xilinx FPGAs provide means
of observing the circuit state through readback, as discussed in Chaper 2. One of the
problems with readback is that the state of the output registers in Virtex BlockRAMs is
altered whenever a readback is performed, thus altering the circuit state. One solution to
this problem is to use scan to read the state of the BlockRAMs, and to use readback to read
the state of all the other memory elements in the design. Some extra logic will be necessary
to disable these other memory elements while scan is being performed on the BlockRAMs
to preserve their state. Unfortunately, the problem with this method is the high LUT and
FF overhead incurred by scan for BlockRAMs, as shown previously in Figures 4.12 and
BC
C
FFs are needed
4.13 in Chapter 4. These figures show that as many as E&J LUTs and
to instrument each BlockRAM in the design for scan. Hence, an alternative solution is
definitely preferrable.
An alternate approach to this BlockRAM readback problem is to add shadow registers with control logic to the output registers of the BlockRAMs to save their state during a
readback, as shown in Figure 5.1. To gain an understanding of how this extra logic works,
consider the normal sequence of events when a readback is performed: (1) the clock edge
arrives, (2) the circuit settles, (3) readback is performed. However, since the state of the
52
BlockRAM
Din
Enable
Dout
n
n
F2
Enable
Enable
F3
D Q
ClkEn
D Q
n
0
n
n
Dout
1
Pre
PrepRB
Addr
F1
D Q
Clr
PrepRB
Figure 5.1: Logic to Fix Readback Problem with Virtex BlockRAMs
BlockRAM output registers is modified during a readback, it will propagate incorrect values to the rest of the circuit as soon as the next clock edge arrives.
Using the logic shown in Figure 5.1, here is the new sequence of events: (1) the
clock edge arrives, (2) the circuit settles, (3) the control signal PrepRB goes high and then
immediately low again, acting like the rising and falling edge of a clock for the bank of
flip-flops labeled F3. When this occurs, F3 now contains a copy of the current state of the
BlockRAM output registers. In addition, the flip-flop labeled F1 is asynchronously reset
by PrepRB so that the multiplexor selects F3 as the Dout output during that cycle. Finally,
(4) readback is performed, but since the Dout seen by the rest of the circuit is really the
contents of F3, it doesn’t matter that the contents of the BlockRAM output registers have
been modified. This process repeats for each cycle that a readback is performed.
A few more details about this circuit are in order. During normal BlockRAM operation, the output registers are refreshed each clock cycle. Thus, F3 always gets valid
data even if a readback was performed the previous cycle. However, whenever the BlockRAM is disabled (the enable signal is low), the output of the BlockRAM does not change.
This means that after a readback is performed, the output register isn’t refreshed with valid
data until the enable goes high again. So in Figure 5.1, the flip-flop F2 ensures that F3 is
only loaded when the BlockRAM presents valid data on its output registers, and F1 always
selects F3 as the Dout signal whenever the Dout from the BlockRAM output registers is
53
invalid.
As can be seen in the figure, the overhead of instrumenting the shadow registers
for a BlockRAM with an output of size c is I flip-flops and I 4-LUTs for control, plus c
flip-flops and c 4-LUTs for the shadow registers for a total overhead of cRQdI flip-flops and
c$QVI 4-LUTs for each port of the BlockRAM. However, if the Enable is shared by different
BlockRAMs or even by both ports of a dual-ported BlockRAM, the control logic may be
shared, thus resulting in a cost of c flip-flops and 4-LUTs for each additional BlockRAM
or BlockRAM port with a common Enable.
Table 5.3: Cost of Repairing BlockRAM Readback
Design
Normal
With BlockRAM Readback Logic
FF
LUT
LE
FF
FF
LUT LUT LE
LE
Count Count Count Count Ratio Count Ratio Count Ratio
Low-power BF
738 14559 14719 1078 1.46 14809 1.01 15231 1.03
CDI
4478 5738 6675
4880 1.09 6065 1.06 7368 1.10
averages
1.28
1.04
1.07
Table 5.3 shows the cost of instrumenting this readback logic for Low-power BF
C
and CDI. Low-power BF contains G 4-bit wide dual-ported BlockRAMs, so the worst-case
overhead of adding this readback logic for each BlockRAM would be L FFs and 4-LUTs for
B
the control and FFs and 4-LUTs for the shadow registers (since the BlockRAMs are dualported, they require twice as much logic than the single-ported BlockRAMs). Multiply this
C
C
number by G and the result is a worst-case overhead of GK FFs and 4-LUTs for Lowpower BF. In reality, the overhead is less than this for two reasons: (1) some of the control
logic can be shared due to common enables and (2) some of the BlockRAM outputs are
loadless, so some of this extra logic gets optimized away by the vendor’s back-end tools.
D
D
for LUTs and G
for LEs. The FF overhead
The result is an overhead of less than E
D
seems high at LK
due to the relatively few FFs used in the original design.
54
CDI can be analyzed similarly since it contains E
B
16-bit wide dual-ported Block-
RAMs. This leads to a worst case overhead of L FFs and 4-LUTs for the control logic, GI
B
FFs and 4-LUTs for the shadow registers, multiplied by E BlockRAMs for a worst-case
B
overhead of KL FFs and 4-LUTs. Again, some of the control logic is shared and about half
of the BlockRAM’s outputs are loadless, so the actual overhead is much less than this. The
D
CD
D
and E
, respectively, while the FF overhead is M .
LUT and LE overhead for CDI is K
These overheads are much smaller than those reported for full-scan; in addition, this
method only uses one additional pin for the PrepRB signal, as opposed to four additional
pins for performing full scan on BlockRAMs. Circuit designs can be instrumented with
this readback logic in the same manner as they were instrumented for scan, as described in
Section 3.1.3.
From these examples we see that scan is fully capable of solving observability and
controllability issues when no other alternatives exist. However, when cheaper alternatives
can be found, they should be used instead of scan.
5.2 Summary of Results
Chapter 4 enumerated the costs of implementing full scan to provide complete observability and controllability of user designs in FPGAs. This chapter has shown how it is
possible to apply techniques related to scan to supplement existing FPGA debug capabilities to provide complete observability and controllability at a lower cost. The tables that
follow show a summary of these results.
Table 5.4 provides a summary of the original area costs for all of the JHDL designs
mentioned in Chapter 4. These designs include the three library modules: cnt, mult, and
cordic, as well as the four large JHDL designs: Eigenray BF, Low-power BF, CDI, and
Superquant.
Table 5.5 shows the overhead of instrumenting these designs will full-scan to provide complete observability and controllability of the designs. The methodology of instrumenting these designs was discussed in Chapter 4. As can be seen from the table, the
B D
average LE area overhead for instrumenting these designs is L , which is nearly double
the size of the original designs.
55
Table 5.4: Original Area of User Designs
Design
FPGA Type
Original
FF
LUT
Count Count
cnt
Virtex
4
4
mult
Virtex
615
270
cordic
Virtex
768
780
Eigenray BF
XC4000
2216 1775
Low-power BF
Virtex
738 14559
CDI
Virtex
4478 5738
Superquant
Virtex
4890 11806
LE
Count
4
630
812
2658
14719
6675
14087
Table 5.5: Area of User Designs w/ Full-Scan
Design
Full Scan
FF
FF
LUT LUT
LE
Count Ratio Count Ratio Count
cnt
4
1.00
9
2.25
9
mult
615
1.00
871
3.23
871
cordic
768
1.00 1596 2.05 1596
Eigenray BF
2222 1.00 3413 1.92 3445
Low-power BF 1307 1.77 24245 1.67 24391
CDI
5455 1.22 12812 2.23 13434
Superquant
4896 1.00 32192 2.73 32192
averages
1.14
2.30
56
LE
Ratio
2.25
1.38
1.97
1.30
1.66
2.01
2.29
1.84
Table 5.6: Best-Case Results for Improving Observability
Design
Using BlockRAM Readback Logic
FF
FF
LUT LUT
LE
LE
Count Ratio Count Ratio Count Ratio
cnt
4
1.00
4
1.00
4
1.00
mult
615
1.00
270
1.00
630
1.00
cordic
768
1.00
780
1.00
812
1.00
Eigenray BF
2216 1.00 1775 1.00 2658 1.00
Low-power BF 1078 1.46 14809 1.01 15231 1.03
CDI
4880 1.09 6065 1.06 7368 1.10
Superquant
4890 1.00 11806 1.00 14087 1.00
averages
1.08
1.01
1.02
Table 5.7: Best-Case Results for Improving Controllability
Design
Scanning Only FFs
FF
FF
LUT LUT
LE
LE
Count Ratio Count Ratio Count Ratio
cnt
4
1.00
9
2.25
9
2.25
mult
615
1.00
871
3.23
871
1.38
cordic
768
1.00 1596 2.05 1596 1.97
Eigenray BF
2216 1.00 3306 1.86 3427 1.29
Low-power BF
738
1.00 16035 1.10 16035 1.09
CDI
4478 1.00 10945 1.91 10945 1.64
Superquant
4890 1.00 20342 1.72 20342 1.44
averages
1.00
2.02
1.58
Table 5.8: Best-Case Results for Improving Observability and Controllability
Design
BlockRAM Readback Logic/Scanning Only FFs
FF
FF
LUT LUT
LE
LE
Count Ratio Count Ratio Count Ratio
cnt
4
1.00
9
2.25
9
2.25
mult
615
1.00
871
3.23
871
1.38
cordic
768
1.00 1596 2.05 1596
1.97
Eigenray BF
2216 1.00 3306 1.86 3427
1.29
Low-power BF 1078 1.46 16362 1.12 16584 1.13
CDI
4880 1.09 11371 1.98 11679 1.75
Superquant
4890 1.00 20342 1.72 20342 1.44
averages
1.08
2.03
1.60
57
Sometimes, circuit designers are not nearly as interested in being able to modify
their designs as they are in simply being able to completely observe the circuit state. Thus,
Table 5.6 shows the best results in overhead that can be achieved in Xilinx Virtex and
XC4000 FPGAs when using readback to only observe the circuit state. Since readback’s
limitation to observing circuit state is its modification of the BlockRAM output registers
during the readback process, the table reflects the cost of instrumenting the extra readback
logic discussed in Section 5.1.2. Four of the designs are Virtex designs that do not use
BlockRAMs, and one of the designs is an XC4000 design, which does not support the use
of BlockRAMs; these designs do not incur any area overhead by using readback to obtain
the circuit state. Only two designs shown in the table contained BlockRAMs: Low-power
D
CD
and E
LE overhead for instrumenting the extra
BF and CDI. These designs carry a G
readback logic, respectively.
Table 5.7 shows the best-case overhead that can be accomplished when the user desires controllability, but not observability, of the circuit design. These results were obtained
based on the premise described in Section 5.1.1 that partial reconfiguration can be used to
modify the state of embedded memories, but scan must be used to obtain the state of the FFs
in the design. Hence, the overhead in this table is the overhead of instrumenting only the
FFs in the design for scan, while adding logic to disable all other memories during scan to
preserve their state. Some of these designs contain only FFs with no embedded memories;
in such cases, the overheads are the same as for full-scan since all the FFs must be scanned
to control their state. For the rest of the designs, the overhead decreases substantially since
the embedded memories in the designs do not require the full scan logic. On average, the
BD
cost of providing complete controllability for these designs is a J
LE overhead
Finally, Table 5.8 contains the results of combining the extra BlockRAM readback
logic with scanning only user FFs to provide complete observability and controllability of
CD
to instrument these designs. Alall the designs. It shows an average LE overhead of K
B D
though this is certainly an improvement over the L
average overhead of full scan for the
same designs, it shows that since FPGA vendors currently do not provide full observability
and controllability features on their FPGAs, the cost of attaining such capabilities for debug
is very high.
58
Chapter 6
Other Scan Issues
6.1 Overview
The chapter ties up some other loose ends pertaining to scan. It begins by discussing
some of the FPGA system-level issues associated with scan. For instance, after scan has
been performed on a user design, not only must the circuit state be left unaltered, but the
state of the external memories, FIFOs, etc. must also remain unaltered by scan. In addition,
any reads and writes being performed to external memories must also be unaffected by scan.
Next comes a discussion as to why area costs in FPGAs is so high as compared with VLSI,
and what kind of area improvements can be made by FPGA-vendors adding built-in scan
functionality into their FPGAs. Lastly, this chapter shows how logic can be implemented
in a similar manner to scan to stop the user clock during readback.
6.1.1
FPGA System-Level Issues
When implementing design-level scan in FPGAs, care must be taken so that the
user design is not modified by the scan process. For example, while the state bits are
being scanned out and back into the circuit, both combinational and synchronous values
are switching all the time. If one of these signals drives the reset of a FF and goes high
during the scan process, the FF is then reset and the circuit is unintentionally modified by
scan. For user designs, this problem was solved by using an AND gate to disable all sets,
resets, and other such signals, as was shown in Figure 3.3.
Unfortunately, this solution only works to prevent the user design contained on
the FPGA itself from being modified. Frequently, signals contained in the FPGA design
59
control reads and writes to external memories on an FPGA system. This could result in
undesired writes to the external memories when the circuit is operating in scan mode. An
easily implemented solution, then, is to tri-state the I/O pins connected to the write enables
on the external memories during scan. Since the memory write enables are active low, these
same I/O pins must also be connected to weak pull-ups to disable writing to the external
memories during scan.
Another system-level issue involves handling reads and writes to external memories
that have begun, but have not yet completed when scan first begins. An easy solution is
to buffer the data being read so that it can be used after scan, and to buffer the data being
written to ensure the correct state is still written to the memories.
6.1.2
Scan Overhead in FPGAs vs. VLSI
As was mentioned in Section 4.1.6, various reasons exist as to why scan is so much
more expensive in FPGAs than it is in VLSI. One of the main area considerations is the size
of a 4-input LUT versus the size of a standard D flip-flop with clock enable and set/reset.
According to [14], the transistor area of such a FF is about 18 transistors, whereas the area
for a 4-LUT is about 167 transistors. If the logic required to scan a FF is a multiplexor and
two logic gates, as shown previously in Figure 3.3, this requires three LUTs of overhead,
or approximately 501 transistors! The same scan logic could be implemented in VLSI for
approximately 16 transistors.
Thus, one approach that FPGA-vendors can take is to add built-in scan logic to
each of the flip-flops on an FPGA. Although this would effectively double the area of
C
each flip-flop, this is nothing compared to the G X cost associated with using LUT-logic to
provide scan capabilities. Another approach is to only provide a special scan-multiplexor,
as described in Section 4.1.5. This would cost only an extra L – K transistors more per flipflop, but it could significantly reduce the cost of instrumenting scan. Naturally, another
area overhead includes the extra routing required for scan, but obtaining such numbers is
beyond the scope of this work.
60
6.1.3
Stopping the Global Clock
Readback has been shown as one method to obtain the state of a user circuit. How-
ever, as was discussed in Section 2.1.3, readback requires the ability of the FPGA system
to stop the user clock. Unfortunately, this is not possible in many FPGA-based systems.
Using scan to retrieve the circuit state avoided this limitation since the clock is still running when operating in scan mode, although the circuit is not doing any useful work during
those clock cycles. The main disadvantage to scan, though, is the large area overhead it
incurs.
D
D
Q
Q
Clk_En
Clk_En
GCE
GCE = Global Clock Enable
Clk
Figure 6.1: Global Clock Enable Circuitry for Flip-Flops
0
D
D Q
Q
1
GCE
Figure 6.2: Global Clock Enable Circuitry for Flip-Flops without Clock Enables
An approach similar to instrumenting scan logic into user circuitry can be used to
61
Table 6.1: Area Overheads for Clock-Stopping Circuitry
Design
Normal
AND gate approach
MUX approach
LUT LE
LUT LUT LE
LE
LUT LUT LE
LE
Count Count Count Ratio Count Ratio Count Ratio Count Ratio
Eigenray BF
1775 2658 1835 1.03 2694 1.01 3345 1.88 3371 1.27
Low-power BF 14559 14719 15524 1.07 15673 1.06 16104 1.11 16104 1.09
CDI
5738 6675 6572 1.15 7545 1.13 10426 1.82 10426 1.56
Superquant
11806 14087 21672 1.84 21672 1.54 20065 1.70 20065 1.42
averages
1.27
1.19
1.63
1.34
effectively stop the user clock during a readback. Instead of using a ScanEnable signal,
this circuitry uses a Global Clock Enable (GCE) signal. Adding the GCE logic consists of
placing an AND gate in front of the clock enable to each flip-flop and in front of the chip
enable or write enable to each embedded RAM, as shown in Figure 6.1. Whenever GCE
is pulled low, all the FFs and embedded RAMs are disabled. If the flip-flop does not have
a clock enable, it must either be replaced by an equivalent FF that does, or else it must be
instrumented as shown in Figure 6.2. As seen in the figure, such FFs use a multiplexor in
front of the D input so that when GCE is low, the FF does not update with new data.
Table 6.1 shows the overhead associated with implementing the AND gate and the
multiplexor approaches to stop the global clock. As can be seen from the table, if all of the
FFs in the design have clock enables, the AND gate approach may be used, which leads
D
D
more LUTs, or EM
more LEs. However, if the FFs do
to an average overhead of IN
D
D
in terms of LUTs and GL
in terms of
not have clock enables, the overhead is about KG
LEs. The reason the overhead is much greater for the multiplexor case is because the GCE
multiplexor cannot be shared by multiple FFs, but the AND gate can be shared for FFs that
share a common enable. Also, the multiplexor logic can require one or more additional
LUTs to route the output of the flip-flop back to the input of the multiplexor, which adds
additional overhead.
62
Chapter 7
Conclusions
7.1 Conclusions and Future Work
This work has described the limitations associated with providing complete observability and controllability for functional verification of FPGA-based designs, and has
shown how instrumenting design-level scan overcomes these limitations. For example,
some FPGAs provide built-in tools, such as Xilinx’s ChipScope and Altera’s SignalTap
features, that provide visibility into the state of the circuit. However, these features not
only provide limited visibility into the circuit state, but changing the signals being viewed
requires multiple time-consuming runs through the vendor’s place and route tools. Configuration readback is another method for providing circuit visibility, but it also has limitations
in viewing the state of the circuit. Lastly, no method currently exists at all for completely
configuring the circuit to a known state.
A design-level scan methodology was proposed to provide complete observability
and controllability for functional verification of FPGA-based designs. This degree of observability and controllability comes at a high cost, however; on average, it roughly doubles
CD
. When a designer has a circuit that
the size of circuits and reduces their clock rates by I
needs validating, these costs may be justified if the designer can take advantage of fast
hardware execution rather than being forced to use software simulation to validate the design; thus, reducing the overall “time-to-market” for the design. In addition, design-level
scan costs are temporary since the scan logic can be removed for the final “production”
design. This suggests that the development and debugging environment might benefit from
a larger FPGA, while the final production design may fit on a smaller FPGA [15]. The
63
main caveat to this approach, however, is ensuring that the larger FPGA has the same pin
out as the smaller FPGA.
Clearly, the costs of using FPGA logic over device-level transistors to improve design verification are large. Chapter 5 proposed a few methods to supplement already existing visibility and controllability features, which improves the costs, but it still falls far short
of matching the costs for VLSI. The best approach for providing complete observability
and controllability of user circuits, then, is to modify the FPGA architectures themselves.
Chapter 6 showed how using LUTs for scan consumes an order of magnitude more silicon
area than a FF does. Thus, vendor-supplied instrumentation would provide much lower
overheads than those seen in these experiments. In addition to flip-flops, vendor-supplied
instrumentation should address embedded memories as well. It needs to also support both
reading and writing user design state.
In the meantime, there are several possible extensions to this work in design-level
scan for functional verification of FPGA-based designs. An obvious one is to explore other
device-level instrumentation mechanisms—the proposed scan chain is just one possibility.
Second, although this work described how to instrument some of the more common design
elements for scan, techniques must be developed which can integrate other FPGA design
primitives into scan chains. These include I/O blocks and fully-synchronous, single-ported
embedded RAMs. A third possibility is to explore scanning I/O blocks and other designlevel primitives through JTAG pins. Fourth, this scan methodology should be extended to
work with designs that have multiple clocks or gated clocks—common features in today’s
designs. For gated clocks, one possible solution is to OR the gated clock logic with the
ScanEnable to force the clock to be enabled during scan. A solution for multiple clocks
might require multiple scan chains that share the same control logic and I/O pins. Fifth,
the system-level issues described in Section 6.1.1 need to be implemented and automated
for actual FPGA-based systems. Finally, more work could be done to research partial scan
techniques to improve functional verification at a lower cost.
64
Bibliography
[1] T. W. Williams and K. P. Parker, “Design for testability - a survey”, IEEE Transactions on Computers, vol. C-31, no. 1, pp. 2–15, January 1982.
[2] J. M. Arnold, “The Splash 2 software environment”, in Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, D. A. Buell and K. L. Pocek, Eds.,
Napa, CA, Apr. 1993, pp. 88–93.
[3] J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati, and P. Boucard, “Programmable active memories: Reconfigurable systems come of age”, IEEE Transactions on VLSI Systems, vol. 4, no. 1, pp. 56–69, 1996.
[4] P. Graham, B. Hutchings, and B. Nelson, “Improving the fpga design process through
determining and applying logical-to-physical design mappings”, Technical Report
CCL-2000-GHN-1, Brigham Young University, Provo, UT, April 2000.
[5] B. L. Hutchings and B. E. Nelson, “Unifying simulation and execution in a design
environment for fpga systems”, IEEE Transactions on VLSI Systems, to appear.
[6] P. Bellows and B. L. Hutchings, “JHDL - an HDL for reconfigurable systems”, in
Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, J. M.
Arnold and K. L. Pocek, Eds., Napa, CA, Apr. 1998, pp. 175–184.
[7] B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, and M. Rytting, “A
cad suite for high-performance fpga design”, in Proceedings of the IEEE Workshop
on FPGAs for Custom Computing Machines, K. L. Pocek and J. M. Arnold, Eds.,
Napa, CA, April 1999, IEEE Computer Society, p. n/a, IEEE.
[8] A. L. Crouch, Design for Test for Digital IC’s and Embedded Core Systems, chapter 3,
p. 97, Prentice Hall PTR, Upper Saddle River, NJ, 1999.
65
[9] S. L. Hurst, VLSI Testing: Digital and Mixed Analogue/Digital Techniques, chapter 5, p. 218, Number 9 in IEE Circuits, Devices and Systems Series. Institution of
Electrical Engineers, London, 1998.
[10] M. J. S. Smith, Application Specific Integrated Circuits, chapter 14, p. 764, AddisonWesley, Reading, Mass., 1997.
[11] S. Scalera, M. Falco, and B. Nelson, “A reconfigurable computing architecture for microsensors”, in Proceedings of the IEEE Symposium on Field-Programmable Custom
Computing Machines, Kenneth L. Pocek and Jeffery M. Arnold, Eds., Napa, April
2000, IEEE Computer Society, p. TBA, IEEE Computer Society Press.
[12] M. Wirthlin, S. Morrison, P. Graham, and B. Bray, “Improving the performance
and efficiency of an adaptive amplification operation using configurable hardware”,
in Proceedings of the IEEE Symposium on Field-Programmable Custom Computing
Machines, Kenneth L. Pocek and Jeffery M. Arnold, Eds., Napa, April 2000, IEEE
Computer Society, p. TBA, IEEE Computer Society Press.
[13] AndreĢ DeHon, Reconfigurable Architectures for General-Purpose Computing, PhD
thesis, Massachusetts Institute of Technology, September 1996.
[14] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs, chapter Appendix B, p. 216, The Kluwer International Series in Engineering
and Computer Science. Kluwer Academic Publishers, Boston, 1999.
[15] S. Trimberger, “A reprogrammable gate array and applications”, in Proceedings of
the IEEE., July 1993, vol. 81, pp. 1030–1041.
66