White Paper No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs April 2012 Author Angela Sutton, Staff Product Marketing Manager, Synopsys, Inc. It comes as no surprise that the designers of FPGAs for military and aerospace applications are interested in increasing the reliability and availability of their designs. This is, of course, particularly true in the case of mission-critical and safety-critical electronic systems. But the need for high-reliability and high-availability electronic systems has expanded beyond traditional military and aerospace applications. Today, this growing list includes communications infrastructure systems, medical intensive care and life-support systems (such as heart-lung machines, mechanical ventilation machines, infusion pumps, radiation therapy machines, robotic surgery machines), nuclear reactor and other power station control systems, transportation signaling and control systems, amusement ride control systems, and the list goes on. How can designers maintain high standards and ensure success for these types of demanding designs? The answers are here. In this paper we will review the definitions of key concepts: mission critical, safety critical, high reliability and high availability. We will then consider the various elements associated with the creation of high-reliability and high-availability FPGA designs. Key Concepts Mission-Critical: A mission-critical design refers to those portions of a system that are absolutely necessary. The concept originates from NASA, where mission-critical elements were those items that had to work or a billion dollar space mission would blow up. Mission-critical systems must be able to handle peak loads, scale on demand and always maintain sufficient functionality to complete the mission. Safety-Critical: A safety-critical or life-critical system is one whose failure or malfunction may result in death or serious injury to people, loss of or severe damage to equipment or damage to the environment. The main object of safety-critical design is to prevent the system from responding to a fault with wrong conclusions or wrong outputs. If a fault is severe enough to cause a system failure, then the system must fail “gracefully,” without generating bad data or inappropriate outputs. For many safety-critical systems, such as medical infusion pumps and cancer irradiation systems, the safe state upon detection of a failure is to immediately stop and turn the system off. A safety-critical system is one that has been designed to lose less than one life per billion hours of operation. High-Reliability: In the context of an electronic system, the term “reliability” refers to the ability of a system or component to perform its required function(s) under stated conditions for a specified period of time. This is often defined as a probability. A high-reliability system is one that will remain functional for a longer period of time, even in adverse conditions. Some reliability regimes for mission-critical and safety-critical systems are as follows: `` Fail-Operational systems continue to operate when their control systems fail, for example electronically controlled car doors that can be unlocked even if the locking control mechanism fails `` Fail-Safe systems automatically become safe when they can no longer operate. Many medical systems fall into this category, such as x-ray machines, which will switch off when an error is detected `` Fail-Secure systems maintain maximum security when they can no longer operate; while fail-safe electronic doors unlock during power failures, their fail-secure counterparts would lock. For example, a bank’s safe will automatically go into lockdown when the power goes out `` Fail-Passive systems continue to operate in the event of a system failure. In the case of a failure in an aircraft’s autopilot, for example, the aircraft should remain in a state that can be controlled by the pilot `` Fault-Tolerant systems avoid service failure when faults are introduced into the system. The normal method to tolerate faults is to continually self-test the parts of a system and to switch in duplicate redundant backup circuitry, called hot spares, for failing subsystems High-Availability: Users want their electronic systems to be ready to serve them at all times. The term “availability” refers to the ability of the user community to access the system; if a user cannot access the system it is said to be “unavailable.” The term “downtime” is used to refer to periods when a system is unavailable for use. Availability is usually expressed as a percentage of uptime over some specified duration. Table 1 reflects the translation from a given availability percentage to the corresponding amount of time a system would be unavailable per week, month or year. Downtime Availability (%) Per week Per Month* Per year 90% “one nine” 16.8 hours 72 hours 36.5 days 99% “two nines” 1.68 hours 7.2 hours 3.65 days 99.9% “three nines” 10.1 minutes 43.2 minutes 8.76 hours 99.99% “four nines” 1.01 minutes 4.32 minutes 52.56 minutes 99.999% “five nines” 6.05 seconds 25.9 seconds 5.256 minutes 99.9999% “six nines” 0.605 seconds 2.59 seconds 31.5 seconds *A 30-day month is assumed for monthly calculations. Table 1. Availability (as a percentage) versus downtime Key Elements of an FPGA Design and Verification Flow In this section we will briefly consider the various elements associated with an FPGA design specification, creation and verification flow in the context of creating high-reliability and high-availability designs. These elements are depicted in Figure 1 and we will explore them in more detail throughout the course of this paper, with particular emphasis on designs intended for mission-critical and safety-critical applications. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 2 Methodologies, Processes, and Standards Low-Power Design | Distributed Design | Traceability, Repeatability, Design Management Virtual Prototype Req Spec Algorithmic Exploration Eng/Arch Spec High-Level Synthesis Design (RTL) Capture IP Selection Simulation Simulation Synthesis/ Optimization Gate-Level State Machines Formal Verification Figure 1: Elements of an FPGA design and verification flow Methodologies, Processes and Standards A key element in creating high-reliability and high-availability designs is to adopt standards such as the ISO 9001 quality management standard. Also, it is vital to define internal methodologies and processes that meet DO-254 (and other safety-critical) certification needs. The DO-254 standard was originally intended to provide a way to deliver safe and reliable designs for airborne systems. This standard was subsequently adopted by the creators of a variety of high-reliability and high-availability electronic systems. In Europe, industrial automation equipment manufacturers are required to develop their safety-critical designs according to the ISO 13849 and IEC 62061 standards. Both of these standards are based upon the generic IEC 61508 standard, which defines requirements for the development of safety products using FPGAs. In order to meet these standards, designers of safety-critical systems must validate the software, every component and all of the development tools used in the design. Requirements Specification The first step in the process of developing a new design is to capture the requirements for that design. This may be thought of as the “what” (what we want) rather than the “how” (how are we going to Req Spec achieve this). At the time of writing, a requirements specification is typically captured and presented only in a human-readable form such as a written document. In some cases, this document is created by an external body in the form of a request for proposal (RFP). In conventional design environments, the requirements specification is largely divorced from the remainder of the process. This can lead to problems such as the final product not fully addressing all of the requirements. In the case of high-reliability and high-availability designs, it is necessary to provide some mechanism for the requirements to be captured in a machine-readable form—perhaps as line items in a database—and for downstream specification and implementation details to be tied back to their associated requirements. This helps to ensure that each requirement has been fully addressed and that no requirement “falls through the cracks.” No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 3 Engineering and Architectural Specification The next step in the process is to define the architecture of the system along with the detailed Eng/Arch Spec engineering specification for the design. This step includes decisions on how to partition the system into its hardware and software components. It also includes specifying the desired failure modes (failoperational, fail-safe, fail-secure, fail-passive), and considering any special test logic that may be required to detect and diagnose failures when the system has been deployed in the field. In some cases it may involve defining the architecture of the system in such a way as to avoid a single point of failure. If a system requires two data channels, for example, implementing both channels in a single FPGA makes that FPGA a single point of failure for both channels. By comparison, splitting the functionality across multiple FPGAs means that at least one channel will remain alive. The creation and capture of the engineering and architectural specification is the result of expert designers and system architects making educated guesses. The process typically involves using whiteboards and spreadsheets and may be assisted by the use of transaction-level system simulation, which is described in the Architecture Exploration and Performance Analysis section below. Today, the engineering and architectural specification is typically captured and presented only in a human readable form such as Word® documents and Excel® spreadsheets. In conventional design environments, this specification is not necessarily directly tied to the original requirements specification or the downstream implementation. In the case of high-reliability and high-availability designs, it is necessary to provide some mechanism for the engineering and architectural specification to be captured in a machine-readable form such that it can be tied to the original upstream requirements and also to the downstream implementation. Architecture Exploration and Performance Analysis There is currently a tremendous growth in the development of systems that involve multiple processors and multiple hardware accelerators Virtual Prototype operating in closely-coupled or networked topologies. In addition to tiered memory structures and multilayer bus structures, these systems, which may be executing hundreds of millions to tens of billions of instructions per second, feature extremely complex software components, and the software content is currently increasing almost exponentially. One aid to the development of the most appropriate system architecture is to use a transaction-level simulation model, or virtual prototype, of the system to explore, analyze and optimize the behavior and performance of the proposed hardware architecture. To enable this, available models of the global interconnect and shared memory subsystem are typically combined with traffic generators that represent the performance workload of each application subsystem. Simulation and collection of analysis data enables users to estimate performance before software is available and optimize architecture and algorithmic parameters for best results. Hardware-software performance validation can follow by replacing the traffic generators with processor models running the actual system software. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 4 Accurate measurement based on transaction traffic and software workloads that model real-world system behavior (performance, power consumption, etc.), allows system architects to ensure that the resulting design is reliable and meets the performance goals of the architecture specification without overdesign. It allows the architects to make accurate decisions and make hardware/software tradeoffs early in the design process so that changes and recommendations can be made early, reducing project risk. Once an optimal architecture has been determined, the transaction-level performance model of the system can become a “golden reference model” against which the hardware design teams later verify the actual functionality of the hardware portions of the design. Distributed Design The creation of complex FPGAs may involve multiple system architects, system engineers, hardware design engineers, software developers and verification engineers. These engineers could be split into multiple teams, which may span multiple companies and/or may be geographically dispersed around the world. Aside from anything else, considerations about how different portions of the design are to be partitioned across different teams may influence the engineering and architectural specification. A key consideration is that the entire design and verification environment should be architected so as to facilitate highly distributed design, parallel design creation and verification, all while allowing requirements and modifications to be tracked and traced. This means, for example, ensuring that no one can modify an interface without all relevant/impacted people being informed that such a change has taken place; also the recording of the fact that a change has been made, who made that change and the reason that the change was made. Part of this includes the ability to relate implementation decisions and details to specific items in the engineering and architectural specification. Also required is the ability to track progress and report the ongoing status of the project. Distributed design also requires very sophisticated configuration management, including the ability to take snapshots of all portions of the design (that is, the current state of all of the hardware and software files associated with the design), hierarchical design methodologies, along with support for revisions, versions and archiving of the design and the entire environment used to create the design. This allows the process by which the design was created to become fully repeatable. Algorithmic Exploration With regard to design blocks that perform digital signal processing (DSP)—it may be necessary to explore a Algorithmic Exploration variety of algorithmic approaches to determine the optimal solution that satisfies the performance and power consumption requirements as defined by the overall engineering and architectural specification. In this case, it is common to capture these portions of the design at a very high level of abstraction. This can be done using model-based design concepts or by creating plain functional C/C++/SystemC models. These high-level representations are also used to explore the effects of fixed-point quantization. The design environment should allow testbenches that are created to verify any high-level representations of the design to also be used throughout the remainder of the flow. This ensures that the RTL created during algorithmic exploration fully matches its algorithmic counterpart. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 5 High-Level Synthesis (HLS) High-Level Synthesis As was discussed in the previous topic, some portions of the design may commence with representations created at a high-level of abstraction. These representations are initially used to validate and finetune the desired behavior of the design. The next step is to select the optimal micro-architectures for these portions of the design and to then progress these micro-architectures into actual implementations. Until recently, the transition from an original high-level representation to the corresponding micro-architecture and implementation was performed by hand, which was time-consuming and prone to error. Also, due to tight development schedules, designers rarely had the luxury of experimenting with alternative micro-architecture and implementation scenarios. Instead, it was common to opt for a micro-architecture and implementation that were guaranteed to work, even if the results were less-than-optimal in terms of power consumption, performance and silicon area. High-Level Synthesis (HLS) refers to the ability to take the original high-level representation and to automatically synthesize it into an equivalent RTL implementation, thereby eliminating human-induced errors associated with manual translation. The use of HLS also allows system architects and designers to experiment with a variety of alternative implementation scenarios so as to select the optimal implementation for a particular application. Furthermore, HLS allows the same original representation to be re-targeted to different implementations for different deployments. Selection and Verification of Intellectual Property Today’s high-end FPGA designs can contain the equivalent of hundreds of thousands or even millions of logic gates. Creating each new design from the ground up would be extremely resource-intensive, time-consuming and error-prone. Thus, in order to manage this complexity, around 75% of a modern IP Selection design may consist of intellectual property (IP) blocks. Some of these blocks may be internally generated from previous designs; others may come from thirdparty vendors. In fact it is not unusual for an FPGA design to include third-party IP blocks from multiple vendors. In some cases the IP may be delivered as human-readable RTL; in other cases it may be encrypted or obfuscated. Sometimes the IP vendor may deliver two different models—one at a high-level of abstraction for use with software and one at the gate-level for implementation into the design. To create high-reliability and high-availability FPGA designs, the design environment must allow selection and integration of these IP blocks. Also, the IP blocks should be testable and be delivered with testbenches. Even if the IP is encrypted or obfuscated, there should be visibility into key internal registers to facilitate verification and debug in the context of the entire design. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 6 State Machines FPGA designs often include the use of one or more state machines. In fact, as opposed to a single large state machine, it is common to employ a large number of smaller machines that interact with each other, often in extremely complicated ways. State Machines In order to create high-reliability and high-availability FPGA designs, it is necessary to create the control logic associated with these multiple state machines in such a way as to ensure that they don’t “step on each other’s toes.” For example, it would be easy to create two state machines, each of which can write data into the same first-in, first-out memory (FIFO). When this portion of the testbench is created, its designer will ensure that both of the state machines can indeed write into the FIFO. However, the testbench designer may neglect to test the case in which both state machines attempt to simultaneously access the FIFO. This type of scenario can become exceedingly complicated when only a few state machines are involved, and it can become overwhelmingly complex as the number of state machines increases. To address this problem, special tools and techniques are available to ensure that whenever there is the potential for such a problem to occur, the design engineer is informed and is also required to make a decision. In the case of multiple state machines writing to the same FIFO, for example, the designer may decide to specify a priority order (“State machine A has priority over state machine B, which in turn has priority over state machine C,” and so forth). Alternatively, the designer may decide to use a “round robin” approach in which each of the state machines take things in turn. The key point is that the control logic for the state machines should be designed from the ground up in such a way that the machines cannot interfere with each other in an undefined manner. Another consideration with state machines is how to design them in such a way that they cannot power-up into an undefined or illegal state; also that nothing can occur to cause them to transition into an undefined or illegal state. Once again, there are tools and techniques that can aid designers in creating high-reliability and high-availability state machines of this nature. That said, irrespective of the quality of the design, radiation events can potentially cause a state machine to enter an undefined or illegal state. In order to address this, additional logic must be included in the design to detect and mitigate such an occurrence. This topic is explored in more detail in the Creating Radiation-Tolerant FPGA Designs section later in this paper. RTL Synthesis and Optimization To facilitate verification and debug, all aspects of the design must be traceable to ensure that the implementation correctly reflects the intended design Synthesis/ Optimization functionality. During the process of synthesizing an RTL representation into its gate-level equivalent, for example, it is necessary to keep track of the relationship between designer-specified signal names in the RTL and automatically-generated signal names in the gate-level representation to ease the task of instrumenting the design, further described in the Verification and Debug section below, and to support cross-probing between the gate and RTL levels. This means that even when working with the physical device operating on the board, signal values are automatically presented to the users in the context of the RTL source code with which they are most familiar, dramatically increasing the ease and speed of debugging. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 7 Today’s logic and physical RTL synthesis and optimization tools are incredibly powerful and sophisticated. Countless hours have been devoted to developing algorithms that result in optimal designs that use the lowest possible power, consume the smallest possible amount of FPGA resources (which translates as “silicon area” in ASIC terms), and extract the maximum level of performance out of the device. However, in order to create high-reliability and high-availability FPGA designs, it may NOT be desirable for the synthesis tool to perform all of the optimizations of which it is capable. For example, it may be desirable to be able to preserve certain nodes all the way through the design process; that is, to identify specific nodes in the RTL representation and to maintain these nodes in the gate-level representation and also in the physical device following the mapping of the logic into the FPGA’s look-up tables (LUTs). Furthermore, it would be undesirable for the synthesis tool to inadvertently remove any logic that it regarded as being unnecessary, but that the designers had specifically included in the design to support downstream verification, debug and test. Similarly, in the case of radiation-tolerant designs that employ triple modular redundancy (TMR) in which logic is triplicated and voting circuits are used to select the majority view from the three circuits—it would be unfortunate, to say the least, if the synthesis tool determined that this redundant logic was unnecessary and decided to remove it. The end result is that it must be possible for the users of the synthesis technology to be able to control the tool and to instruct it about which portions of the design can be rigorously optimized and which portions of the design serve a debug or redundant circuitry purpose and must therefore be preserved unchanged. Furthermore, it must be possible to be able to tie these decisions back to specific elements in the engineering and architectural specification, which are themselves associated with specific items in the original requirements specification. Verification and Debug There are many aspects to verification and debug that affect the creation of high-reliability and highavailability FPGA designs. For example, it is necessary to be able to perform formal equivalence checking between the various representations of the design such as the RTL and gate-level descriptions to ensure that any transformations performed by synthesis and optimization have not impacted the desired functionality of the design. Another consideration is that the design environment should allow any testbenches that were created to verify the high-level representations during the architecture exploration and algorithmic exploration portions of the design flow to be reused throughout the remainder of the flow. This ensures that the RTL and gate-level implementations fully match their algorithmic counterparts. One very important consideration is the ability to instrument the RTL with special debug logic in the form of virtual logic analyzers. This allows the designer to specify which signals internal to the device are to be monitored along with any trigger conditions that will turn the monitoring on and off. These logic analyzers will subsequently be synthesized into the design and loaded into the physical FPGA. In addition to the fact that this technology should be quick and easy to use, the environment must keep track of the relationship between designer-specified signal names in the RTL and automatically-generated signal names in the gate-level representation. This means that even when working with the physical device, signal values are automatically presented to the users in the context of the RTL source code with which they are most familiar, which dramatically increases the ease and speed of debugging. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 8 Low-Power Design Over the past few years, power consumption has moved to the forefront of FPGA design and verification concerns. Power consumption has a direct impact on various aspects of the design, including its cost and reliability. For example, consider a multi-FPGA design that consumes so much power that it is necessary to employ a fan for cooling purposes. In addition to increasing the cost of the system (and also the fact that the fan itself consumes more power) the use of the fan impacts the reliability and availability of the system. This is because a failure of the fan, which is a very common occurrence, can cause the system to overheat and fail/ shut-down. In the not-so-distant past, power considerations were relegated to the later stages of the FPGA development flow. By comparison, in the case of today’s extremely complex FPGA designs, “low power” isn’t just something that can be simply “bolted” on at the end of the development process. System architects and design engineers need to be able to estimate power early on and to measure power later on because the consequences of running too hot may necessitate time-consuming design re-spins. In order to meet aggressive design schedules, it is no longer sufficient to consider power only in the implementation phase of the design. The size and complexity of today’s FPGAs makes it imperative to consider power throughout the entire development process, from the engineering and architectural specification phase, through the virtual prototyping and algorithmic evaluation portions of the flow, all the way to implementation with power-aware synthesis and optimization. Creating Radiation-Tolerant FPGA Designs It is well known that the designers of equipment intended for deployment in hostile environments—such as nuclear power stations and aerospace applications—have to expend time and effort to ensure that the electronic components chosen are physically resistant to the effects of radiation—radiation-hardened (radhard). In addition to the rad-hard components themselves, it is also necessary to create the designs to be radiation tolerant (rad-tolerant), which means that the designs are created in such a way as to mitigate the effects of any radiation events. Such rad-tolerant designs may contain, for example, built-in error correcting memory architectures and include built-in redundant circuit elements. In reality, radiation from one source or another is all around us all the time. In addition to cosmic rays that are raining down on us from above, radioactive elements are found in the ground we walk on, the air we breathe and the food we eat. Even the materials used to create the packages for electronic components such as silicon chips can spontaneously emit radioactive particles. This was not a significant problem until recently, because the structures created in the silicon were relatively large and were not typically affected by the types and strengths of radioactive sources found close to the Earth’s surface. However, in our efforts to increase silicon capacity, increase performance, reduce power consumption and lower costs—each new generation of integrated circuit features smaller and smaller transistors. Work has already commenced on rolling out devices at the 28-nm node, with the 22-/20-nm node not far behind. These structures are so small that they can be affected by the levels of radiation found on Earth. Radiation-induced errors can result in a telecom router shutting down, a control system failing to respond to a command or an implantable medical device incorrectly interpreting a patient’s condition and responding inappropriately. These are just a few examples of many high-reliability or mission-critical systems that require designers to understand and account for radiation-induced effects. A radiation event may flip the state of a sequential element in the design such as a register or a memory cell— this is known as a single-event upset (SEU). Alternatively, a radiation event may cause an unwanted transient in the combinatorial logic—this is referred to as a single-event transient (SET). If an SET is clocked into a register or stored in a memory element, then it becomes an SEU. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 9 Insertion of error detection and mitigation strategies is key to the alleviation of SEUs. Some techniques are listed in Table 2. Error Detection Error Migration TMR: Triplicate logic and compare outputs then report any mismatch Create mitigation logic to masks fault · Distributed TMR: Triplicate submodules prone to SEUs/SETs and vote on the outputs · Fault tolerant FSMs using Hamming-3 encoding for immunity against single bit errors · ECC RAMs (with TMR) for single bit error detection and correction in memories Safe FSM and Safe Sequential Circuitry: Create and preserve the custom error-detection circuitry specified in your RTL Periodically, scrub the device. Reprogram device on the fly Table 2: SEU Error detection and mitigation approaches In order to be able to create radiation-tolerant high-reliability and high-availability FPGA designs, design tools need to be able to take the original RTL specified by the designers and to automatically replicate parts of the circuit, for example, implement TMR. Distributed TMR inserts redundancy automatically into the design by triplicating all or part of the logic in a circuit and then adds in “majority voting” logic to determine the best two out of three results in case a signal is changed due to an SEU. TMR is, by its very nature, expensive on resources so it is usual to apply TMR to just those parts of the design that the designer considers to be the most critical parts of the circuit. The synthesis tools can typically help you to specify where you want redundancy and the tool will then automatically apply it during synthesis. TMR may be required at the register level, individual memory level, the block level or at the entire system level. In the case of state machines, it is no longer sufficient to just create a design that cannot clock the state machine into an illegal state. Today, that state machine could be forced into an illegal state by a radiation event that flips a state register. Thus, the design tools must be capable of taking the original state machine representation defined by the design and augmenting it with the ability to detect and mitigate radiationinduced errors. Safe FSM and safe sequential circuitry implementations involve using error-detection circuitry to force a state machine or sequential logic into a reset state or into a user-defined error state so the error can be handled in a custom manner as specified by the user in their RTL. The user can, for example, specify the mitigation circuitry as an RTL “others” clause. The synthesis software will then automatically implement this circuitry so that, should an error occur during operation of the design, the FSM or sequential logic will return operation into a safe state such as a reset or default state. Fault-tolerant FSMs with Hamming-3 encoding, for example, can be used to detect and correct single bit errors with a Hamming distance of 3, ensuring that a state register erroneously reaching an adjacent state would be detected and correct operation of the FSM continues automatically. Prior to synthesis, the designer need only tell the synthesis tool that they wish to use a Hamming-3 encoding strategy for designated FSMs. The synthesis tool will automatically create all circuitry for error detection and mitigation and the design will automatically continue to run in the event of an error. Error correcting code (ECC) memories may be used to detect and correct single-bit errors. ECC memories combined with TMR prevent false data from being captured by the memory and from being propagated to parts of the circuitry that the RAM output controls. Once you specify in the RTL or constraints file which memory functions are safety critical for your design, the synthesis software knows No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 10 to automatically infer the ECC memories offered by many FPGA vendors and automatically makes the proper circuit connections and, if requested, deploys additional TMR. Furthermore, FPGA-based designs have an additional consideration with regard to their configuration cells. Thus far, the majority of FPGAs used in high-radiation environments have been based on antifuse configuration cells. These have the advantage of being immune to radiation events, but they have the disadvantage of being only one-time programmable. Also, antifuse-based FPGAs are typically one or two technology nodes behind the highest performance, highest capacity state-of-the-art SRAM-based devices. While users are aware of the advantages offered by SRAM-based FPGAs, they realize their design (and design tools) must offer some way to mitigate against radiation-induced errors in the configuration cells. In non-antifuse FPGA technologies, automated TMR, the ability of the software to select ECC memories, as well as generate safe or fault-tolerant FSMs as described above are ways to alleviate SEUs. Considerations when deciding where and what techniques to deploy involve both risk and tradeoffs between cost and performance. Ultimately, during synthesis, it is important for the software to allow the user to select and control the specific error detection and mitigation strategies to use and where in the design to deploy each of them. Software Considerations The task of creating high-reliability and high-availability FPGA designs involves all aspects of the system, including both the hardware and software components. Software has become an increasingly critical part of nearly all present day systems. As with hardware, creating high-reliability, high-availability software depends on good requirements, design and implementation. In turn, this relies heavily on a disciplined software engineering process that will anticipate and design against unintended consequences. Traceability, Repeatability and Design Management The concepts of traceability, repeatability and design management permeate the entire development flow when it comes to creating high-reliability and high-availability FPGA designs. Right from the origination of a new development project, it is necessary to build project plans, to track project deliverables against milestones and to constantly monitor the status of the project to ensure that the schedule will be met successfully. As has been noted throughout this paper, this requires some way to capture the original requirements in a machine readable form and to associate individual elements in the engineering and architectural specification with corresponding items in the requirements specification. Similarly, as the design proceeds through architecture exploration, algorithmic evaluation, high-level synthesis, RTL capture, logic and physically aware synthesis, every aspect of the implementation should be associated with corresponding items in the engineering and architectural specification. The development environment also needs to support design and configuration management, including the ability to take snapshots of a distributed design (that is, the current state of all of the hardware and software files associated with the design), along with support for revisions and versions and archiving. This is important for every design, especially those involving hardware, software and verification engineers that are split into multiple teams, which may span multiple companies and/or may be geographically dispersed around the world. No Room for Error: Creating Highly Reliable, High-Availability FPGA Designs 11 Summary For FPGAs designed at the 28-nm node and below, high reliability and high availability of the resulting systems are of great concern for a wide variety of target application areas. Fortunately, techniques are now available within Synopsys EDA tools to automate aspects of developing both mission-critical and safety-critical FPGA-based systems. These tools and techniques span engineering and architectural specification and exploration, the ability to incorporate pre-verified IP within your design and techniques to trace, track and document project requirements every step of the way to ensure compliance with industry practices and standards such as DO-254. Using Synopsys tools, engineers can now create radiation-tolerant FPGA designs by incorporating deliberate redundancy within their design and by developing safe state machines with custom error mitigation logic that returns the design to a known safe state of operation, should an error occur due to radiation effects. This logic can ensure high system availability in the field and provide reliable system operation. Synopsys tools also enable you to verify reliable and correct operation of your design by allowing you to create an implementation and then monitor, probe and debug its operation on the board to ensure correct system behavior. Specifically, you can probe, monitor and debug your design operation from the RTL level while running the design on the board. During the design creation process, design engineers may additionally choose to use Synopsys formal verification equivalence checking, virtual prototyping and software simulation to validate functional correctness and to ensure that performance and power needs are being met. For more details on solutions that help you develop highly reliable, high-availability designs, please contact Synopsys and visit http://www.synopsys.com/FPGA Synopsys, Inc. 700 East Middlefield Road Mountain View, CA 94043 www.synopsys.com ©2012 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks is available at http://www.synopsys.com/copyright.html. All other names mentioned herein are trademarks or registered trademarks of their respective owners. 04/12.RP.CS1598.