A Simplified Approach to Fault Tolerant State Machine

A Simplified Approach to Fault Tolerant State Machine Design for Single Event Upsets Melanie Berg Principle Design Engineer, Ball Aerospace & Technologies Corp. 1 ABSTRACT As Integrated Circuit (IC) geometries become smaller and core voltages scale down, the probability of incurring system faults increases significantly. Errors occur when charged particles penetrate a memory cell and cross a junction, creating an aberrant charge that changes the state of the bit. Due to the complexity and sensitivity of large designs, fault tolerant schemes are evolving at the gate level. This paper will address Single Event Upsets (SEUs) within edge-triggered D-Flip-Flops (DFFs) and assumes that the upsets are soft (correctable by the following clock edge … if the DFF is enabled). Solely considering single events is a fair assessment due to the low probability of having multiple errors occur within one clock cycle. Due to the radiation effects in space, the Aerospace industry has always had to design with SEU consideration. As far as gate-level DFF protection is concerned, Triple Mode Redundant (TMR – voting) logic is the most commonly used scheme to combat SEUs. However, TMR can be very area extensive and - in a turbulent environment – may not fully erase the probability of upsets. As a solution, many error-coding techniques have been proposed as a compliment (or replacement) to TMR, however due to their complexity, they are rarely implemented. A simplified approach to fault tolerant state machine design starting from architectural development through synthesis will also be discussed. Examples of coding schemes that include additional logic for error detection and (in some cases) correction such as One Hot, Sequential, and Hamming will be examined. Due to the fact that users have run into roadblocks with synthesis tools “optimizing” away necessary logic for error handling, special attention will be given to the synthesis tools concerning the necessary techniques involved in producing the correct realization of functionality. 2 FAULT TOLERANCE The definition of Fault Tolerance is the ability to mask or recover from erroneous conditions in a system once an error has been detected. 2.1 Determining the Level of Fault Tolerance for Your System The degree of fault tolerance implementation is defined by your system level requirements… i.e. specifications that clearly state acceptable behavior upon error. The following are some questions that must be answered within the system requirements documentation:  Does your system only need to detect an error?  How quickly must the system respond to an error?  Must your system also correct the error?  Is the system susceptible to more than one error per clock cycle? 2.2 D Common Approach: TMR (Triple Mode Redundancy) SET CLR D SET Q Q Q Voting Logic CLR D SET CLR Q Q Q Figure 1: TMR Implementation Diagram Triple Mode Redundancy (TMR) is the most commonly implemented solution of SEU tolerance because it is a very simple solution. One needs to be cautious when implementing mitigation. It is important that the voting logic is glitch free. Otherwise, the glitchy TMR voting logic (which can occur due to mitigation across separate clock domains or hazardous combinational logic) can be caught by a clock edge. In low frequency designs, the probability of mitigation glitches being caught by clock edges is low. However, as we increase the frequency, so does the probability. Probability of glitch capture is based on clock speed, fan-out, and the width of the glitch (speed of the gate logic within the voting circuitry). The following figures demonstrate a poor implementation of TMR using a 32-bit counter. A counter was chosen for this example due to the large fanout of the circuitry. sysclk Reset If Outsig glitches near a clock edge, unpredictable results within the counter occur A B C OutSig TMR Circuit E 32 bits Counter For this example, C will be hit by an SEU, the TMR logic should stay stable. However, poor TMR circuitry was synthesized and a glitch occurs on OutSig Figure 2: 32 Bit counter with TMR Figure 3: Waveform Demonstrating the Effects of Glithes in the TMR Logic 3 SYNCHRONOUS DESIGN WITH ASYNCHRONOUS EVENTS It is important to note that SEUs are asynchronous. It is impossible to control when (or where) the event will occur. This is a point that is overwhelmingly overlooked. Because this paper focuses on sequential Single Event Upsets (SEUs) within a synchronous design environment, it is important to discuss the affects of an asynchronous event (in which every internal DFF is susceptible) within a synchronous design environment. If the event occurs well within the clock period, then there may be enough time to deterministically correct the error (depending on the scheme and clock speed). However, if the SEU occurs near a clock edge, Metastability (or unpredictable events) can occur. Current mitigation schemes do not take this into account and thus decreases the anticipated reliability. 4 MOTIVATION FOR STATE MACHINE FAULT TOLERANCE DFFs play an important role within synchronous designs. They are contained within one clock domain and act as deterministic timing boundaries. Such a design strategy (synchronous techniques) increases the verification coverage of a circuit and enhances production. Synchronous state machines use DFFs to hold its current state value. State Machines are generally used as controllers and are at the heart of most designs. If a synchronous state machine is not designed to accommodate Single Event Upsets, the circuit can become locked or produce unpredictable behavior until a reset is generated. Unfortunately, waiting for a system reset may not be a suitable solution. Thus, designers should be aware of techniques for error detection and perhaps correction specifically for erroneous state conditions. It is important to note that current synthesis tools are specifically geared towards area and timing optimization. Their algorithms will want to “erase” redundant schemes for fault tolerance that the designer places within the HDL code. Therefore, the designer must also be familiar with the synthesis package of choice and apply the necessary directives for preserving redundant logic. 5 SYNCHRONOUS STATE MACHINE IMPLEMENTATION Generally, state machines are utilized as controlling mechanisms within a design. They determine when signals should turn on or off, when to implement (or stop implementing) a function, how long to wait for events, etc…. However, state machines are not necessary to implement such functionality. As a matter of fact, using state machines has the tendency to increase the number of total gates within a design. In addition, if operating a state machine in a disruptive environment (such as a space mission), the entire system can lock up into an unreachable condition. So why use state machines? Utilization of state machines afford the designer: an easy to follow design methodology, manageable design reviews, and a hook into a systematic verification process. It also alleviates the propensity of creating “spaghetti “(out of control) designs. 5.1 Structure A synchronous state machine is designed to deterministically transition through a pattern of defined states. A state is represented by a register (set of DFFs) and is referred to as the “current state.” CLOCK INPUTS CURRENT STATE OUTPUTS NEXT STATE Figure 4: Traditional State Machine Logic Diagram The structure consists of four parts: 1. Inputs: All inputs must be synchronous to a “synchronous” state machine. Inputs are used to determine next state transitions. 2. Next state logic: Combinatorial logic used to generate the next stage (state) for the machine. The next state value is a function of the state machine’s inputs and its current state. 3. Current State Register: Register of n-bit DFFs used to hold the current state of the machine. Synchronous state machines change state only by a clock edge. 4. Output logic: The output can be purely combinatorial or registered (generally it is preferable to register the logic). The outputs can be a function of the next state and/or current state (and perhaps the direct inputs) of the machine. 5.2 Encoding Schemes Each state of a state machine must be mapped into some type of encoding (pattern of bits). Once the state has been mapped, it is then considered a defined (legal) state. Depending on the encoding scheme and the number of states within the design, there can be “unused” encoded patterns – i.e. there is no defined mapping from the encoded pattern to a state. In such a case, if there is a fault within the circuitry and the state machine jumps into one of these unmapped (illegal) states, then the circuit has the potential to become locked (or stuck) in some unreachable condition (undefined state). The current most popular encoding schemes used by designers are: One Hot, Binary, and Gray. The reader should be aware that there are many other schemes; however, this paper will only address Binary and One Hot – plus a special encoding technique specifically for correction. Registers: binary encoding Good state : SEND_DATA STATES (5): 1 0 0 IDLE :000 GET_DATA :001 PROCESS_DATA:010 BAD_DATA :011 SEND_DATA :100 1 1 0 Bad state: unmapped Registers: One Hot encoding STATES (5): Good state : SEND_DATA IDLE :00001 GET_DATA :00010 PROCESS_DATA:00100 BAD_DATA :01000 SEND_DATA :10000 1 0 0 0 0 1 1 0 0 0 Bad state: unmapped Figure 5: Binary and One Hot Encoding Schemes - Mapped States vs. Unmapped States 5.2.1 Binary The Binary Encoding technique maps states into a base 2 counting scheme. (Log2(NStates)) registers are required – i.e. for 5 states (NStates), 3 registers (NReg) are required. There will always be 2NReg – Nstates unmapped states. Because 2NReg – Nstates will always be less than NStates (theory of binary of encoding), the probability of flipping into a mapped state is higher than flipping into an unmapped state (IDLE or error state was not including in this deduction). Plainly speaking, it is more likely to jump into “any” of the mapped states than the IDLE state (the word “any” brings the non-deterministic nature to this problem). State(1) Flips upon SEU: Using the “Safe” attribute will transition the user to a specified legal state upon an SEU 2 1 0 1 0 0 1 1 0 Good State STATES (5): Illegal State: unmapped IDLE TURNON_A TURNOFF_A TURNON_B TURNOFF_B :000 :001 :010 :011 :100 Using the “Safe” attribute will not detect the SEU: This could cause detrimental behavior 2 1 0 0 0 1 0 1 1 Good State: TURNON_A legal State: TURNON_B Figure 6: SEU and Binary Encoding Response 5.2.2 One-Hot The One-Hot encoding requires only one bit to be turned on at a time, i.e. each state is mapped to one bit. One Hot state-machines require more registers than Binary or Gray thus ultimately have more “illegal states”. However, it takes two bits to flip in order to transition into a mapped state. This makes detecting errors within One-Hot easy. If implemented correctly, this can make transitioning from an unmapped state deterministic. 5.2.3 Binary vs. One Hot There are many debates over which technique is best for fault detection implementation: one-hot vs. sequential. The two styles differ such that if a bit has flipped (a fault) the binary value of the one-hot encoding will no longer be “One-Hot” (more than one bit will be a ‘1’ or all bits will be ‘0’). Thus the error detection lies within the binary representation. With Binary Encoding this is not true. The reasoning is that a flipped bit (a fault) can represent a “legal” state. Therefore, combinatorial error detection will not suffice – the logic needs memory to hold the parity of the good state. Some designers believe that the use of the extra parity bit (extra DFF within the state machine) can be beneficial for fault detection in sequential machines. This is a false statement. There are many scenarios that can be presented that show the implications of assuming a level of fault detection while implementing sequential encoding and parity. Most of them are based on the premise that multiple paths within the circuitry have different routing delays and if an SEU happens near a clock edge, unpredictable results can occur. The obvious problem is that an incorrect transition can occur without any detection. This is a phenomenon of a “hamming distance of one” encoding scheme such as the sequential state machine. Employing a hamming-1 encode is not a recommended practice while implementing fault detection and correction. 6 FAULT TOLERANT STATE MACHINE IMPLEMENTATION The main objective of a fault tolerant state machine is to be able to detect an error (a flipped bit in the current state registers) and have a deterministic response within a deterministic time frame. The definition of the response and its response time is dependent on the design requirements. For example, the response may be as simple as indicating that an error occurred and waiting for a signal to bring the state machine to a known “working” state… or the response may be as complex as automatic correction of the error within one clock cycle. 6.1 State Machines and the Bucket Approach Most designers believe that they are implementing a bucket-approach to fault tolerance when using the default clause (VHDL- when others) within a CASE statement. However, the synthesis tools ignore these statements when synthesizing state machines. Most designers are not aware that these statements (that suggest “bucketing” unused states and supplying a transition out of a fault condition) have no bearing on the actual gates being created. The reasoning behind this is that if the synthesis tool were to implement all unused states, the required area would be excessive. Synthesis tools have a directive that the designer can use called “safe”. When applying this directive to a state machine, the tool can produce a bucket of illegal states. However, the designer must be forewarned that using the “safe” directive has many flaws: 1. It generally creates more gates than desired, i.e. buckets of illegal states can become very large. 2. Synplify will implement a “safe” one-hot but Leonardo and Precision will only implement a sequential state machine (binary or Gray encoding) while using the “safe” directive. 6.2 How Safe is the “SAFE” Option? Recently, there has been a surge of designers using the “safe” options offered by several different synthesis tools. They have defined a safe state machine as one that will always transition to a known state - i.e., if an SEU occurs and an illegal or unmapped state is reached, one will recover to a known state. Synthesis tools offer a “Safe” option (demand from the Aerospace industry): TYPE states IS ( IDLE, GET_DATA, PROCESS_DATA, SEND_DATA, BAD_DATA ); SIGNAL current_state, next_state : states; attribute SAFE_FSM: Boolean; attribute SAFE_FSM of states: type is true; Although this scheme may appear to be “fool-proof”, it is not. The “safe” option will not always transition to a reliable (or known) state. Some have defined the word “known” as mapped. However, the idea is to be deterministic. The idea is to always flip into some “ERROR” or “IDLE” state upon getting an SEU. Flipping into a mapped state, in which the rest of the system may not know the state machine is there, is not deterministic. It is not good enough to transition to a mapped state:  The machine can get stuck waiting for an input that will never occur (rest of the system doesn’t know that you have flipped into this mapped state).  Unexpected signals can turn on (or off).  And many more reasons… The compilers are geared towards implementing the “safe” option with Binary Encoding. In such a case, there exists a false sense of safety. For example: if an SEU transitions the state machine (i.e. one of the registers gets hit and flips) into a mapped (legal) state (however it is an illegal transition to that state) an extreme fault can occur – there can be severe output behavior with no error detection. Most importantly, note that in a Binary Encoded State Machine there will always be a higher probability of flipping into a mapped state (“good”) state, than there is flipping into an unmapped state. Unfortunately, Binary Encoded state machines can be faulty. The designer must consider the list of potential problems before using the safe directive as the method for fault detection. An alternate approach beyond using the synthesis “safe” directive should be taken to ensure trust worthy fault tolerance. This paper will first address single-bit error detection techniques. Afterwards, a robust single-bit error correction scheme along with a new encoding technique will be presented. 6.3 Single-Bit Error Detection and Recovery This section will focus on detecting that an error has occurred within one of the state machine DFFs and then recovering from the fault. Remember that it is necessary for the designer to manually provide recovery logic. 6.3.1 One-Hot The beauty of the one hot encoding style stems from the fact that each state has a hamming distance of 2 (it takes 2 transitions to get from one state to another) thereby it inherently has SEU error detection. During normal operation of a one hot state machine, only one bit is turned on – indicating which state is current. Thus the current state register should always have an odd parity. Each transition requires 2 bits to flip (one bit turns off while one bit turns on). If there is an SEU within the state machine, only one bit will flip. The parity will then switch to even. Example: Using the “five state” example, assume the circuit is in GET_DATA (“00010” - odd). AN SEU can cause the state machine to have the encoded pattern of “00000”. However, “00000” is not mapped to any state and is easily detected because it has even parity. It is thus considered an undefined or illegal state for the state machine. The same theory can be used if the state machine was in state GET_DATA and the current state flip flops changed to: “10010”. 6.3.2 Using Multiplexer Control for Next State Transitions The most efficient means of SEU error detection and recovery in one-hot state machines is the use of a combinatorial parity checker over the one-hot current state register set. The designer would use the output of the error detection logic to either place the state machine into its normal operational next state at the following clock edge or to set the machine into a designated error state (could also be as simple as going back to a reset state) upon detecting an SEU. The fact that the error detection is purely combinatorial logic is a plus because it contains no DFF’s and thus the detection logic cannot create a fault of its own. Clock Error State Pattern Inputs Q CLR Q Q CLR D SET Q D SET Outputs Metastability filter to synchronize SEU Current State MUX Next State XNOR combinational logic Figure 7: SEU Error Detection within One Hot Encoding Note: The user must declare the state machine (using attributes) as one-hot to ensure that the encoding is as expected. 6.4 Error Correction Notice that this section discusses Correction which is distinctively different than detection. There are many publications on the theory of Error Correction. However, none directly address how to correctly implement state machine fault correction while using current day synthesis tools. As mentioned earlier, synthesis tools focus on area reduction and speed – the tools will want to “optimize” away the necessary redundant logic for fault tolerance. Thus the designer is left to manually create the desired fault tolerant logic. 6.4.1 Error Correction Implementation using Hamming 3 Encoding One method of correction is based off of creating a substantial hamming distance between state encodings. Each state has a base encoding plus companion encodings. The companions are every sequence that can be a hamming distance of one away (only one bit is different) from the base state encoding (thus representing a possible SEU). All of the base encodings must be a hamming distance of three away from each other. The following is an example of a four state implementation – States A,B,C,D: 1. Find an encoding such that the states have a hamming distance of 3 (at least 3 bits must be different from state to state)...  00000 (state-A)  11100(state-B)  01111(state-C)  10011(state-D) Five bits are necessary to encode a four-state machine in order to achieve the required hamming distance of three. 2. For each encoding, calculate the companion encodings such that the hamming distance is one... for example - the companion encodings for state A (00000) is: 00001,00010,00100,01000,10000 (only one bit differs from state A for each companion state); B(11100) is 11101,11110,11001,10100,01100 ... and so on... 3. Now when implementing the state machine, state A is actually 00000 plus its companion encodings so that a possible SEU is covered (do the same for all other states). 6.4.2 Error Correction and Glitch Control One major issue that is extremely overlooked is SEUs occurring near clock edges. When implementing fault tolerance, the designer must be aware of all scenarios. If the designer does not carefully code the state machine in HDL, glitches (static hazards) can occur. It is left to the designer to shut off the synthesis optimization in order to ensure glitch-free state machine transitions. In a synchronous design, glitches are generally not a problem as long as all critical paths meet timing, outputs are registered, and any asynchronous signals are synchronized to the system clock. However, an SEU can happen at any time and create dangerous asynchronous behavior – i.e., a glitch near a clock edge. In order to safe guard against glitchy circuitry, the designer must understand how glitches are formed. Referencing a Karnaugh-Map, prime implicants are formed by grouping together adjacent, equivalent values. When the initial and final inputs are covered by the same prime implicant, no glitch can occur. However, when the input change spans prime implicants, a glitch can occur. The beauty of the proposed method of error correction is that hazard-free prime implicants (SOP – Sum of Products) can be easily derived and hand coded because all companion states are a hamming distance of one away from the base state. For simplicity, the K-Map presented in Figure 3 only shows 4 dimensions. State(0) 00 00 1 01 1 01 11 1 10 1 State(3) 11 State(2) 10 1 State(1) StateA companion states SOP (including State(4) dimension): State(0)State(1)State(2)State(3) + State(0)State(1)State(2)State(4) + State(0)State(1)State(3)State(4) + State(0)State(2)State(3)State(4) + State(1)State(2)State(3)State(4) Figure 8: Karnaugh Map and Sum of Products (SOP) for StateA and its Companion States As one can see from the K-Map in Figure 8, if the current state is “00000” and an SEU occurs, there will be a glitch-free transition into one of the companion states as long as the SOP contains the prime implicants circled in the K-Map. 7 CONCLUSIONS The usage of state machines in order to efficiently control mechanisms within a design has become very popular. This paper proposes methods of Fault Tolerant State Machine implementation due to potential IC SEU susceptibility. Before choosing any particular technique, the designer should first be aware of the probability of electrical faults and then decide the appropriate level of coverage necessary to achieve a robust design. Special directives must be used in order to drive the synthesis tools when implementing fault tolerant redundant logic because the tools are generally focused on area and speed optimization. Thus, once the gates are produced, the designer should check that no functionality has been “optimized” away and that the appropriate state machine has been realized. 8 REFERENCES [1] Douglas Smith, HDL Chip Design, Doone Publications, Madison, AL, USA, 1996 [2] Parag K. Lala, Self-Checking and FaultTolerant Digital Design, Academic Press, 2001 [3] Randy H. Katz, Contemporary Logic Design, Benjamin/Cummings Publishing Company, Inc., 1994 [4] PrecisionTM Synthesis Reference Manual, Mentor Graphics Corp., Release 2003b.

A Simplified Approach to Fault Tolerant State Machine

Related documents

Products

Support

A Simplified Approach to Fault Tolerant State Machine

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib