No Room for Error: Creating Highly Reliable, High

White Paper
No Room for Error: Creating Highly
Reliable, High-Availability FPGA Designs
April 2012
Author
Angela Sutton,
Staff Product
Marketing Manager,
Synopsys, Inc.
It comes as no surprise that the designers of FPGAs for military and aerospace applications are interested
in increasing the reliability and availability of their designs. This is, of course, particularly true in the case
of mission-critical and safety-critical electronic systems.
But the need for high-reliability and high-availability electronic systems has expanded beyond traditional
military and aerospace applications. Today, this growing list includes communications infrastructure
systems, medical intensive care and life-support systems (such as heart-lung machines, mechanical
ventilation machines, infusion pumps, radiation therapy machines, robotic surgery machines), nuclear
reactor and other power station control systems, transportation signaling and control systems,
amusement ride control systems, and the list goes on.
How can designers maintain high standards and ensure success for these types of demanding designs?
The answers are here. In this paper we will review the definitions of key concepts: mission critical, safety
critical, high reliability and high availability. We will then consider the various elements associated with the
creation of high-reliability and high-availability FPGA designs.
Key Concepts
Mission-Critical: A mission-critical design refers to those portions of a system that are absolutely
necessary. The concept originates from NASA, where mission-critical elements were those items that had
to work or a billion dollar space mission would blow up. Mission-critical systems must be able to handle
peak loads, scale on demand and always maintain sufficient functionality to complete the mission.
Safety-Critical: A safety-critical or life-critical system is one whose failure or malfunction may result in
death or serious injury to people, loss of or severe damage to equipment or damage to the environment.
The main object of safety-critical design is to prevent the system from responding to a fault with wrong
conclusions or wrong outputs. If a fault is severe enough to cause a system failure, then the system must
fail “gracefully,” without generating bad data or inappropriate outputs. For many safety-critical systems,
such as medical infusion pumps and cancer irradiation systems, the safe state upon detection of a failure
is to immediately stop and turn the system off. A safety-critical system is one that has been designed to
lose less than one life per billion hours of operation.
High-Reliability: In the context of an electronic system, the term “reliability” refers to the ability of a system
or component to perform its required function(s) under stated conditions for a specified period of time. This is
often defined as a probability. A high-reliability system is one that will remain functional for a longer period of
time, even in adverse conditions. Some reliability regimes for mission-critical and safety-critical systems are
as follows:
``
Fail-Operational systems continue to operate when their control systems fail, for example
electronically controlled car doors that can be unlocked even if the locking control mechanism fails
``
Fail-Safe systems automatically become safe when they can no longer operate. Many medical
systems fall into this category, such as x-ray machines, which will switch off when an error is
detected
``
Fail-Secure systems maintain maximum security when they can no longer operate; while fail-safe
electronic doors unlock during power failures, their fail-secure counterparts would lock. For example,
a bank’s safe will automatically go into lockdown when the power goes out
``
Fail-Passive systems continue to operate in the event of a system failure. In the case of a failure in
an aircraft’s autopilot, for example, the aircraft should remain in a state that can be controlled by the
pilot
``
Fault-Tolerant systems avoid service failure when faults are introduced into the system. The normal
method to tolerate faults is to continually self-test the parts of a system and to switch in duplicate
redundant backup circuitry, called hot spares, for failing subsystems
High-Availability: Users want their electronic systems to be ready to serve them at all times. The term
“availability” refers to the ability of the user community to access the system; if a user cannot access the
system it is said to be “unavailable.” The term “downtime” is used to refer to periods when a system is
unavailable for use. Availability is usually expressed as a percentage of uptime over some specified duration.
Table 1 reflects the translation from a given availability percentage to the corresponding amount of time a
system would be unavailable per week, month or year.
Downtime
Availability (%)
Per week
Per Month*
Per year
90% “one nine”
16.8 hours
72 hours
36.5 days
99% “two nines”
1.68 hours
7.2 hours
3.65 days
99.9% “three nines”
10.1 minutes
43.2 minutes
8.76 hours
99.99% “four nines”
1.01 minutes
4.32 minutes
52.56 minutes
99.999% “five nines”
6.05 seconds
25.9 seconds
5.256 minutes
99.9999% “six nines”
0.605 seconds
2.59 seconds
31.5 seconds
*A 30-day month is assumed for monthly calculations.
Table 1. Availability (as a percentage) versus downtime
Key Elements of an FPGA Design and Verification Flow
In this section we will briefly consider the various elements associated with an FPGA design specification,
creation and verification flow in the context of creating high-reliability and high-availability designs. These
elements are depicted in Figure 1 and we will explore them in more detail throughout the course of this paper,
with particular emphasis on designs intended for mission-critical and safety-critical applications.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
2
Methodologies, Processes, and Standards
Low-Power Design | Distributed Design | Traceability, Repeatability, Design Management
Virtual
Prototype
Req
Spec
Algorithmic
Exploration
Eng/Arch
Spec
High-Level
Synthesis
Design (RTL) Capture
IP
Selection
Simulation
Simulation
Synthesis/
Optimization
Gate-Level
State
Machines
Formal Verification
Figure 1: Elements of an FPGA design and verification flow
Methodologies, Processes and Standards
A key element in creating high-reliability and high-availability designs is to adopt standards such as the ISO
9001 quality management standard. Also, it is vital to define internal methodologies and processes that meet
DO-254 (and other safety-critical) certification needs. The DO-254 standard was originally intended to provide
a way to deliver safe and reliable designs for airborne systems. This standard was subsequently adopted by
the creators of a variety of high-reliability and high-availability electronic systems.
In Europe, industrial automation equipment manufacturers are required to develop their safety-critical designs
according to the ISO 13849 and IEC 62061 standards. Both of these standards are based upon the generic
IEC 61508 standard, which defines requirements for the development of safety products using FPGAs.
In order to meet these standards, designers of safety-critical systems must validate the software, every
component and all of the development tools used in the design.
Requirements Specification
The first step in the process of developing a new
design is to capture the requirements for that design.
This may be thought of as the “what” (what we
want) rather than the “how” (how are we going to
Req
Spec
achieve this). At the time of writing, a requirements
specification is typically captured and presented only
in a human-readable form such as a written document.
In some cases, this document is created by an external
body in the form of a request for proposal (RFP).
In conventional design environments, the requirements specification is largely divorced from the remainder
of the process. This can lead to problems such as the final product not fully addressing all of the requirements.
In the case of high-reliability and high-availability designs, it is necessary to provide some mechanism for
the requirements to be captured in a machine-readable form—perhaps as line items in a database—and for
downstream specification and implementation details to be tied back to their associated requirements.
This helps to ensure that each requirement has been fully addressed and that no requirement “falls through
the cracks.”
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
3
Engineering and Architectural Specification
The next step in the process is to define the
architecture of the system along with the detailed
Eng/Arch
Spec
engineering specification for the design. This step
includes decisions on how to partition the system
into its hardware and software components. It also
includes specifying the desired failure modes (failoperational, fail-safe, fail-secure, fail-passive), and
considering any special test logic that may be required
to detect and diagnose failures when the system has
been deployed in the field.
In some cases it may involve defining the architecture of the system in such a way as to avoid a single point
of failure. If a system requires two data channels, for example, implementing both channels in a single FPGA
makes that FPGA a single point of failure for both channels. By comparison, splitting the functionality across
multiple FPGAs means that at least one channel will remain alive.
The creation and capture of the engineering and architectural specification is the result of expert designers
and system architects making educated guesses. The process typically involves using whiteboards and
spreadsheets and may be assisted by the use of transaction-level system simulation, which is described in the
Architecture Exploration and Performance Analysis section below.
Today, the engineering and architectural specification is typically captured and presented only in a human
readable form such as Word® documents and Excel® spreadsheets. In conventional design environments,
this specification is not necessarily directly tied to the original requirements specification or the downstream
implementation. In the case of high-reliability and high-availability designs, it is necessary to provide some
mechanism for the engineering and architectural specification to be captured in a machine-readable form
such that it can be tied to the original upstream requirements and also to the downstream implementation.
Architecture Exploration and Performance Analysis
There is currently a tremendous growth in the
development of systems that involve multiple
processors and multiple hardware accelerators
Virtual
Prototype
operating in closely-coupled or networked topologies.
In addition to tiered memory structures and multilayer
bus structures, these systems, which may be
executing hundreds of millions to tens of billions of
instructions per second, feature extremely complex
software components, and the software content is
currently increasing almost exponentially.
One aid to the development of the most appropriate system architecture is to use a transaction-level
simulation model, or virtual prototype, of the system to explore, analyze and optimize the behavior and
performance of the proposed hardware architecture. To enable this, available models of the global
interconnect and shared memory subsystem are typically combined with traffic generators that represent
the performance workload of each application subsystem. Simulation and collection of analysis data enables
users to estimate performance before software is available and optimize architecture and algorithmic
parameters for best results. Hardware-software performance validation can follow by replacing the traffic
generators with processor models running the actual system software.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
4
Accurate measurement based on transaction traffic and software workloads that model real-world system
behavior (performance, power consumption, etc.), allows system architects to ensure that the resulting design
is reliable and meets the performance goals of the architecture specification without overdesign. It allows the
architects to make accurate decisions and make hardware/software tradeoffs early in the design process so
that changes and recommendations can be made early, reducing project risk.
Once an optimal architecture has been determined, the transaction-level performance model of the system
can become a “golden reference model” against which the hardware design teams later verify the actual
functionality of the hardware portions of the design.
Distributed Design
The creation of complex FPGAs may involve multiple system architects, system engineers, hardware design
engineers, software developers and verification engineers. These engineers could be split into multiple teams,
which may span multiple companies and/or may be geographically dispersed around the world. Aside from
anything else, considerations about how different portions of the design are to be partitioned across different
teams may influence the engineering and architectural specification.
A key consideration is that the entire design and verification environment should be architected so as to
facilitate highly distributed design, parallel design creation and verification, all while allowing requirements and
modifications to be tracked and traced. This means, for example, ensuring that no one can modify an interface
without all relevant/impacted people being informed that such a change has taken place; also the recording of
the fact that a change has been made, who made that change and the reason that the change was made. Part
of this includes the ability to relate implementation decisions and details to specific items in the engineering
and architectural specification. Also required is the ability to track progress and report the ongoing status of
the project.
Distributed design also requires very sophisticated configuration management, including the ability to take
snapshots of all portions of the design (that is, the current state of all of the hardware and software files
associated with the design), hierarchical design methodologies, along with support for revisions, versions and
archiving of the design and the entire environment used to create the design. This allows the process by which
the design was created to become fully repeatable.
Algorithmic Exploration
With regard to design blocks that perform digital signal
processing (DSP)—it may be necessary to explore a
Algorithmic
Exploration
variety of algorithmic approaches to determine the
optimal solution that satisfies the performance and
power consumption requirements as defined by the
overall engineering and architectural specification.
In this case, it is common to capture these portions of
the design at a very high level of abstraction. This can
be done using model-based design concepts or by
creating plain functional C/C++/SystemC models. These high-level representations are also used to explore
the effects of fixed-point quantization. The design environment should allow testbenches that are created to
verify any high-level representations of the design to also be used throughout the remainder of the flow. This
ensures that the RTL created during algorithmic exploration fully matches its algorithmic counterpart.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
5
High-Level Synthesis (HLS)
High-Level
Synthesis
As was discussed in the previous topic, some portions
of the design may commence with representations
created at a high-level of abstraction. These
representations are initially used to validate and finetune the desired behavior of the design. The next step
is to select the optimal micro-architectures for these
portions of the design and to then progress these
micro-architectures into actual implementations. Until
recently, the transition from an original high-level
representation to the corresponding micro-architecture and implementation was performed by hand, which
was time-consuming and prone to error. Also, due to tight development schedules, designers rarely had the
luxury of experimenting with alternative micro-architecture and implementation scenarios. Instead, it was
common to opt for a micro-architecture and implementation that were guaranteed to work, even if the results
were less-than-optimal in terms of power consumption, performance and silicon area.
High-Level Synthesis (HLS) refers to the ability to take the original high-level representation and to
automatically synthesize it into an equivalent RTL implementation, thereby eliminating human-induced errors
associated with manual translation. The use of HLS also allows system architects and designers to experiment
with a variety of alternative implementation scenarios so as to select the optimal implementation for a
particular application. Furthermore, HLS allows the same original representation to be re-targeted to different
implementations for different deployments.
Selection and Verification of Intellectual Property
Today’s high-end FPGA designs can contain the
equivalent of hundreds of thousands or even millions
of logic gates. Creating each new design from the
ground up would be extremely resource-intensive,
time-consuming and error-prone. Thus, in order to
manage this complexity, around 75% of a modern
IP
Selection
design may consist of intellectual property (IP) blocks.
Some of these blocks may be internally generated
from previous designs; others may come from thirdparty vendors. In fact it is not unusual for an FPGA
design to include third-party IP blocks from multiple vendors.
In some cases the IP may be delivered as human-readable RTL; in other cases it may be encrypted or
obfuscated. Sometimes the IP vendor may deliver two different models—one at a high-level of abstraction for
use with software and one at the gate-level for implementation into the design.
To create high-reliability and high-availability FPGA designs, the design environment must allow selection and
integration of these IP blocks. Also, the IP blocks should be testable and be delivered with testbenches. Even
if the IP is encrypted or obfuscated, there should be visibility into key internal registers to facilitate verification
and debug in the context of the entire design.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
6
State Machines
FPGA designs often include the use of one or more
state machines. In fact, as opposed to a single large
state machine, it is common to employ a large number
of smaller machines that interact with each other, often
in extremely complicated ways.
State
Machines
In order to create high-reliability and high-availability
FPGA designs, it is necessary to create the control
logic associated with these multiple state machines
in such a way as to ensure that they don’t “step on
each other’s toes.” For example, it would be easy to create two state machines, each of which can write data
into the same first-in, first-out memory (FIFO). When this portion of the testbench is created, its designer will
ensure that both of the state machines can indeed write into the FIFO. However, the testbench designer may
neglect to test the case in which both state machines attempt to simultaneously access the FIFO.
This type of scenario can become exceedingly complicated when only a few state machines are involved, and
it can become overwhelmingly complex as the number of state machines increases. To address this problem,
special tools and techniques are available to ensure that whenever there is the potential for such a problem
to occur, the design engineer is informed and is also required to make a decision. In the case of multiple state
machines writing to the same FIFO, for example, the designer may decide to specify a priority order (“State
machine A has priority over state machine B, which in turn has priority over state machine C,” and so forth).
Alternatively, the designer may decide to use a “round robin” approach in which each of the state machines
take things in turn. The key point is that the control logic for the state machines should be designed from the
ground up in such a way that the machines cannot interfere with each other in an undefined manner.
Another consideration with state machines is how to design them in such a way that they cannot power-up
into an undefined or illegal state; also that nothing can occur to cause them to transition into an undefined or
illegal state. Once again, there are tools and techniques that can aid designers in creating high-reliability and
high-availability state machines of this nature. That said, irrespective of the quality of the design, radiation
events can potentially cause a state machine to enter an undefined or illegal state. In order to address this,
additional logic must be included in the design to detect and mitigate such an occurrence. This topic is
explored in more detail in the Creating Radiation-Tolerant FPGA Designs section later in this paper.
RTL Synthesis and Optimization
To facilitate verification and debug, all aspects of
the design must be traceable to ensure that the
implementation correctly reflects the intended design
Synthesis/
Optimization
functionality. During the process of synthesizing
an RTL representation into its gate-level equivalent,
for example, it is necessary to keep track of the
relationship between designer-specified signal
names in the RTL and automatically-generated signal
names in the gate-level representation to ease the task of instrumenting the design, further described in the
Verification and Debug section below, and to support cross-probing between the gate and RTL levels. This
means that even when working with the physical device operating on the board, signal values are automatically
presented to the users in the context of the RTL source code with which they are most familiar, dramatically
increasing the ease and speed of debugging.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
7
Today’s logic and physical RTL synthesis and optimization tools are incredibly powerful and sophisticated.
Countless hours have been devoted to developing algorithms that result in optimal designs that use the lowest
possible power, consume the smallest possible amount of FPGA resources (which translates as “silicon area”
in ASIC terms), and extract the maximum level of performance out of the device.
However, in order to create high-reliability and high-availability FPGA designs, it may NOT be desirable for
the synthesis tool to perform all of the optimizations of which it is capable. For example, it may be desirable to
be able to preserve certain nodes all the way through the design process; that is, to identify specific nodes in
the RTL representation and to maintain these nodes in the gate-level representation and also in the physical
device following the mapping of the logic into the FPGA’s look-up tables (LUTs).
Furthermore, it would be undesirable for the synthesis tool to inadvertently remove any logic that it regarded
as being unnecessary, but that the designers had specifically included in the design to support downstream
verification, debug and test. Similarly, in the case of radiation-tolerant designs that employ triple modular
redundancy (TMR) in which logic is triplicated and voting circuits are used to select the majority view from the
three circuits—it would be unfortunate, to say the least, if the synthesis tool determined that this redundant
logic was unnecessary and decided to remove it.
The end result is that it must be possible for the users of the synthesis technology to be able to control the
tool and to instruct it about which portions of the design can be rigorously optimized and which portions
of the design serve a debug or redundant circuitry purpose and must therefore be preserved unchanged.
Furthermore, it must be possible to be able to tie these decisions back to specific elements in the engineering
and architectural specification, which are themselves associated with specific items in the original
requirements specification.
Verification and Debug
There are many aspects to verification and debug that affect the creation of high-reliability and highavailability FPGA designs. For example, it is necessary to be able to perform formal equivalence checking
between the various representations of the design such as the RTL and gate-level descriptions to ensure that
any transformations performed by synthesis and optimization have not impacted the desired functionality of
the design.
Another consideration is that the design environment should allow any testbenches that were created to verify
the high-level representations during the architecture exploration and algorithmic exploration portions of
the design flow to be reused throughout the remainder of the flow. This ensures that the RTL and gate-level
implementations fully match their algorithmic counterparts.
One very important consideration is the ability to instrument the RTL with special debug logic in the form
of virtual logic analyzers. This allows the designer to specify which signals internal to the device are to be
monitored along with any trigger conditions that will turn the monitoring on and off. These logic analyzers will
subsequently be synthesized into the design and loaded into the physical FPGA. In addition to the fact that
this technology should be quick and easy to use, the environment must keep track of the relationship between
designer-specified signal names in the RTL and automatically-generated signal names in the gate-level
representation. This means that even when working with the physical device, signal values are automatically
presented to the users in the context of the RTL source code with which they are most familiar, which
dramatically increases the ease and speed of debugging.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
8
Low-Power Design
Over the past few years, power consumption has moved to the forefront of FPGA design and verification
concerns. Power consumption has a direct impact on various aspects of the design, including its cost and
reliability. For example, consider a multi-FPGA design that consumes so much power that it is necessary to
employ a fan for cooling purposes. In addition to increasing the cost of the system (and also the fact that the
fan itself consumes more power) the use of the fan impacts the reliability and availability of the system. This is
because a failure of the fan, which is a very common occurrence, can cause the system to overheat and fail/
shut-down.
In the not-so-distant past, power considerations were relegated to the later stages of the FPGA development
flow. By comparison, in the case of today’s extremely complex FPGA designs, “low power” isn’t just
something that can be simply “bolted” on at the end of the development process. System architects and
design engineers need to be able to estimate power early on and to measure power later on because
the consequences of running too hot may necessitate time-consuming design re-spins. In order to meet
aggressive design schedules, it is no longer sufficient to consider power only in the implementation phase of
the design. The size and complexity of today’s FPGAs makes it imperative to consider power throughout the
entire development process, from the engineering and architectural specification phase, through the virtual
prototyping and algorithmic evaluation portions of the flow, all the way to implementation with power-aware
synthesis and optimization.
Creating Radiation-Tolerant FPGA Designs
It is well known that the designers of equipment intended for deployment in hostile environments—such
as nuclear power stations and aerospace applications—have to expend time and effort to ensure that the
electronic components chosen are physically resistant to the effects of radiation—radiation-hardened (radhard). In addition to the rad-hard components themselves, it is also necessary to create the designs to be
radiation tolerant (rad-tolerant), which means that the designs are created in such a way as to mitigate the
effects of any radiation events. Such rad-tolerant designs may contain, for example, built-in error correcting
memory architectures and include built-in redundant circuit elements.
In reality, radiation from one source or another is all around us all the time. In addition to cosmic rays that are
raining down on us from above, radioactive elements are found in the ground we walk on, the air we breathe
and the food we eat. Even the materials used to create the packages for electronic components such as
silicon chips can spontaneously emit radioactive particles. This was not a significant problem until recently,
because the structures created in the silicon were relatively large and were not typically affected by the types
and strengths of radioactive sources found close to the Earth’s surface.
However, in our efforts to increase silicon capacity, increase performance, reduce power consumption and
lower costs—each new generation of integrated circuit features smaller and smaller transistors. Work has
already commenced on rolling out devices at the 28-nm node, with the 22-/20-nm node not far behind. These
structures are so small that they can be affected by the levels of radiation found on Earth.
Radiation-induced errors can result in a telecom router shutting down, a control system failing to respond to
a command or an implantable medical device incorrectly interpreting a patient’s condition and responding
inappropriately. These are just a few examples of many high-reliability or mission-critical systems that require
designers to understand and account for radiation-induced effects.
A radiation event may flip the state of a sequential element in the design such as a register or a memory cell—
this is known as a single-event upset (SEU). Alternatively, a radiation event may cause an unwanted transient
in the combinatorial logic—this is referred to as a single-event transient (SET). If an SET is clocked into a
register or stored in a memory element, then it becomes an SEU.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
9
Insertion of error detection and mitigation strategies is key to the alleviation of SEUs. Some techniques are
listed in Table 2.
Error Detection
Error Migration
TMR: Triplicate logic and compare outputs then report
any mismatch
Create mitigation logic to masks fault
· Distributed TMR: Triplicate submodules prone to
SEUs/SETs and vote on the outputs
· Fault tolerant FSMs using Hamming-3 encoding for
immunity against single bit errors
· ECC RAMs (with TMR) for single bit error detection
and correction in memories
Safe FSM and Safe Sequential Circuitry: Create and
preserve the custom error-detection circuitry specified
in your RTL
Periodically, scrub the device. Reprogram device on the
fly
Table 2: SEU Error detection and mitigation approaches
In order to be able to create radiation-tolerant high-reliability and high-availability FPGA designs, design tools
need to be able to take the original RTL specified by the designers and to automatically replicate parts of the
circuit, for example, implement TMR.
Distributed TMR inserts redundancy automatically into the design by triplicating all or part of the logic
in a circuit and then adds in “majority voting” logic to determine the best two out of three results in case
a signal is changed due to an SEU. TMR is, by its very nature, expensive on resources so it is usual to
apply TMR to just those parts of the design that the designer considers to be the most critical parts of the
circuit. The synthesis tools can typically help you to specify where you want redundancy and the tool will
then automatically apply it during synthesis. TMR may be required at the register level, individual memory
level, the block level or at the entire system level.
In the case of state machines, it is no longer sufficient to just create a design that cannot clock the state
machine into an illegal state. Today, that state machine could be forced into an illegal state by a radiation
event that flips a state register. Thus, the design tools must be capable of taking the original state machine
representation defined by the design and augmenting it with the ability to detect and mitigate radiationinduced errors.
Safe FSM and safe sequential circuitry implementations involve using error-detection circuitry to force
a state machine or sequential logic into a reset state or into a user-defined error state so the error can be
handled in a custom manner as specified by the user in their RTL. The user can, for example, specify the
mitigation circuitry as an RTL “others” clause. The synthesis software will then automatically implement
this circuitry so that, should an error occur during operation of the design, the FSM or sequential logic will
return operation into a safe state such as a reset or default state.
Fault-tolerant FSMs with Hamming-3 encoding, for example, can be used to detect and correct single
bit errors with a Hamming distance of 3, ensuring that a state register erroneously reaching an adjacent
state would be detected and correct operation of the FSM continues automatically. Prior to synthesis,
the designer need only tell the synthesis tool that they wish to use a Hamming-3 encoding strategy
for designated FSMs. The synthesis tool will automatically create all circuitry for error detection and
mitigation and the design will automatically continue to run in the event of an error.
Error correcting code (ECC) memories may be used to detect and correct single-bit errors. ECC
memories combined with TMR prevent false data from being captured by the memory and from being
propagated to parts of the circuitry that the RAM output controls. Once you specify in the RTL or
constraints file which memory functions are safety critical for your design, the synthesis software knows
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
10
to automatically infer the ECC memories offered by many FPGA vendors and automatically makes the
proper circuit connections and, if requested, deploys additional TMR.
Furthermore, FPGA-based designs have an additional consideration with regard to their configuration
cells. Thus far, the majority of FPGAs used in high-radiation environments have been based on antifuse
configuration cells. These have the advantage of being immune to radiation events, but they have the
disadvantage of being only one-time programmable. Also, antifuse-based FPGAs are typically one or two
technology nodes behind the highest performance, highest capacity state-of-the-art SRAM-based devices.
While users are aware of the advantages offered by SRAM-based FPGAs, they realize their design (and design
tools) must offer some way to mitigate against radiation-induced errors in the configuration cells.
In non-antifuse FPGA technologies, automated TMR, the ability of the software to select ECC memories, as
well as generate safe or fault-tolerant FSMs as described above are ways to alleviate SEUs. Considerations
when deciding where and what techniques to deploy involve both risk and tradeoffs between cost and
performance. Ultimately, during synthesis, it is important for the software to allow the user to select and
control the specific error detection and mitigation strategies to use and where in the design to deploy each
of them.
Software Considerations
The task of creating high-reliability and high-availability FPGA designs involves all aspects of the system,
including both the hardware and software components. Software has become an increasingly critical part of
nearly all present day systems. As with hardware, creating high-reliability, high-availability software depends
on good requirements, design and implementation. In turn, this relies heavily on a disciplined software
engineering process that will anticipate and design against unintended consequences.
Traceability, Repeatability and Design Management
The concepts of traceability, repeatability and design management permeate the entire development flow
when it comes to creating high-reliability and high-availability FPGA designs. Right from the origination
of a new development project, it is necessary to build project plans, to track project deliverables against
milestones and to constantly monitor the status of the project to ensure that the schedule will be met
successfully.
As has been noted throughout this paper, this requires some way to capture the original requirements in a
machine readable form and to associate individual elements in the engineering and architectural specification
with corresponding items in the requirements specification. Similarly, as the design proceeds through
architecture exploration, algorithmic evaluation, high-level synthesis, RTL capture, logic and physically
aware synthesis, every aspect of the implementation should be associated with corresponding items in the
engineering and architectural specification.
The development environment also needs to support design and configuration management, including the
ability to take snapshots of a distributed design (that is, the current state of all of the hardware and software
files associated with the design), along with support for revisions and versions and archiving. This is important
for every design, especially those involving hardware, software and verification engineers that are split into
multiple teams, which may span multiple companies and/or may be geographically dispersed around the world.
No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs
11
Summary
For FPGAs designed at the 28-nm node and below, high reliability and high availability of the resulting systems
are of great concern for a wide variety of target application areas.
Fortunately, techniques are now available within Synopsys EDA tools to automate aspects of developing both
mission-critical and safety-critical FPGA-based systems. These tools and techniques span engineering and
architectural specification and exploration, the ability to incorporate pre-verified IP within your design and
techniques to trace, track and document project requirements every step of the way to ensure compliance
with industry practices and standards such as DO-254.
Using Synopsys tools, engineers can now create radiation-tolerant FPGA designs by incorporating deliberate
redundancy within their design and by developing safe state machines with custom error mitigation logic that
returns the design to a known safe state of operation, should an error occur due to radiation effects. This logic
can ensure high system availability in the field and provide reliable system operation.
Synopsys tools also enable you to verify reliable and correct operation of your design by allowing you to
create an implementation and then monitor, probe and debug its operation on the board to ensure correct
system behavior. Specifically, you can probe, monitor and debug your design operation from the RTL level
while running the design on the board. During the design creation process, design engineers may additionally
choose to use Synopsys formal verification equivalence checking, virtual prototyping and software simulation
to validate functional correctness and to ensure that performance and power needs are being met.
For more details on solutions that help you develop highly reliable, high-availability designs, please
contact Synopsys and visit http://www.synopsys.com/FPGA
Synopsys, Inc.  700 East Middlefield Road  Mountain View, CA 94043  www.synopsys.com
©2012 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks is
available at http://www.synopsys.com/copyright.html. All other names mentioned herein are trademarks or registered trademarks of their respective owners.
04/12.RP.CS1598.