A Simplified Approach to Fault Tolerant State Machine

advertisement
A Simplified Approach to Fault Tolerant State Machine
Design for Single Event Upsets
Melanie Berg
Principle Design Engineer, Ball Aerospace & Technologies Corp.
1
ABSTRACT
As Integrated Circuit (IC) geometries become smaller and core voltages scale down, the
probability of incurring system faults increases significantly. Errors occur when charged
particles penetrate a memory cell and cross a junction, creating an aberrant charge that
changes the state of the bit. Due to the complexity and sensitivity of large designs, fault
tolerant schemes are evolving at the gate level.
This paper will address Single Event Upsets (SEUs) within edge-triggered D-Flip-Flops
(DFFs) and assumes that the upsets are soft (correctable by the following clock edge … if
the DFF is enabled). Solely considering single events is a fair assessment due to the low
probability of having multiple errors occur within one clock cycle. Due to the radiation
effects in space, the Aerospace industry has always had to design with SEU
consideration. As far as gate-level DFF protection is concerned, Triple Mode Redundant
(TMR – voting) logic is the most commonly used scheme to combat SEUs. However,
TMR can be very area extensive and - in a turbulent environment – may not fully erase
the probability of upsets. As a solution, many error-coding techniques have been
proposed as a compliment (or replacement) to TMR, however due to their complexity,
they are rarely implemented.
A simplified approach to fault tolerant state machine design starting from architectural
development through synthesis will also be discussed. Examples of coding schemes that
include additional logic for error detection and (in some cases) correction such as One
Hot, Sequential, and Hamming will be examined. Due to the fact that users have run into
roadblocks with synthesis tools “optimizing” away necessary logic for error handling,
special attention will be given to the synthesis tools concerning the necessary techniques
involved in producing the correct realization of functionality.
2
FAULT TOLERANCE
The definition of Fault Tolerance is the ability to mask or recover from erroneous
conditions in a system once an error has been detected.
2.1 Determining the Level of Fault Tolerance for Your System
The degree of fault tolerance implementation is defined by your system level
requirements… i.e. specifications that clearly state acceptable behavior upon error. The
following are some questions that must be answered within the system requirements
documentation:
 Does your system only need to detect an error?
 How quickly must the system respond to an error?
 Must your system also correct the error?
 Is the system susceptible to more than one error per clock cycle?
2.2
D
Common Approach: TMR (Triple Mode Redundancy)
SET
CLR
D
SET
Q
Q
Q
Voting
Logic
CLR
D
SET
CLR
Q
Q
Q
Figure 1: TMR Implementation Diagram
Triple Mode Redundancy (TMR) is the most commonly implemented solution of SEU
tolerance because it is a very simple solution. One needs to be cautious when
implementing mitigation. It is important that the voting logic is glitch free. Otherwise,
the glitchy TMR voting logic (which can occur due to mitigation across separate clock
domains or hazardous combinational logic) can be caught by a clock edge. In low
frequency designs, the probability of mitigation glitches being caught by clock edges is
low. However, as we increase the frequency, so does the probability. Probability of
glitch capture is based on clock speed, fan-out, and the width of the glitch (speed of the
gate logic within the voting circuitry).
The following figures demonstrate a poor implementation of TMR using a 32-bit counter.
A counter was chosen for this example due to the large fanout of the circuitry.
sysclk
Reset
If Outsig glitches near a
clock edge, unpredictable
results within the counter
occur
A
B
C
OutSig
TMR Circuit
E
32 bits
Counter
For this example, C will be hit by
an SEU, the TMR logic should
stay stable. However, poor
TMR circuitry was synthesized
and a glitch occurs on OutSig
Figure 2: 32 Bit counter with TMR
Figure 3: Waveform Demonstrating the Effects of Glithes in the TMR Logic
3
SYNCHRONOUS DESIGN WITH ASYNCHRONOUS EVENTS
It is important to note that SEUs are asynchronous. It is impossible to control when (or
where) the event will occur. This is a point that is overwhelmingly overlooked.
Because this paper focuses on sequential Single Event Upsets (SEUs) within a
synchronous design environment, it is important to discuss the affects of an asynchronous
event (in which every internal DFF is susceptible) within a synchronous design
environment. If the event occurs well within the clock period, then there may be enough
time to deterministically correct the error (depending on the scheme and clock speed).
However, if the SEU occurs near a clock edge, Metastability (or unpredictable events)
can occur. Current mitigation schemes do not take this into account and thus decreases
the anticipated reliability.
4
MOTIVATION FOR STATE MACHINE FAULT TOLERANCE
DFFs play an important role within synchronous designs. They are contained within one
clock domain and act as deterministic timing boundaries. Such a design strategy
(synchronous techniques) increases the verification coverage of a circuit and enhances
production.
Synchronous state machines use DFFs to hold its current state value. State Machines are
generally used as controllers and are at the heart of most designs. If a synchronous state
machine is not designed to accommodate Single Event Upsets, the circuit can become
locked or produce unpredictable behavior until a reset is generated. Unfortunately,
waiting for a system reset may not be a suitable solution. Thus, designers should be
aware of techniques for error detection and perhaps correction specifically for erroneous
state conditions.
It is important to note that current synthesis tools are specifically geared towards area and
timing optimization. Their algorithms will want to “erase” redundant schemes for fault
tolerance that the designer places within the HDL code. Therefore, the designer must also
be familiar with the synthesis package of choice and apply the necessary directives for
preserving redundant logic.
5
SYNCHRONOUS STATE MACHINE IMPLEMENTATION
Generally, state machines are utilized as controlling mechanisms within a design. They
determine when signals should turn on or off, when to implement (or stop implementing)
a function, how long to wait for events, etc…. However, state machines are not necessary
to implement such functionality. As a matter of fact, using state machines has the
tendency to increase the number of total gates within a design. In addition, if operating a
state machine in a disruptive environment (such as a space mission), the entire system
can lock up into an unreachable condition. So why use state machines? Utilization of
state machines afford the designer: an easy to follow design methodology, manageable
design reviews, and a hook into a systematic verification process. It also alleviates the
propensity of creating “spaghetti “(out of control) designs.
5.1
Structure
A synchronous state machine is designed to deterministically transition through a pattern
of defined states. A state is represented by a register (set of DFFs) and is referred to as
the “current state.”
CLOCK
INPUTS
CURRENT
STATE
OUTPUTS
NEXT STATE
Figure 4: Traditional State Machine Logic Diagram
The structure consists of four parts:
1. Inputs: All inputs must be synchronous to a “synchronous” state machine. Inputs are
used to determine next state transitions.
2. Next state logic: Combinatorial logic used to generate the next stage (state) for the
machine. The next state value is a function of the state machine’s inputs and its
current state.
3. Current State Register: Register of n-bit DFFs used to hold the current state of the
machine. Synchronous state machines change state only by a clock edge.
4. Output logic: The output can be purely combinatorial or registered (generally it is
preferable to register the logic). The outputs can be a function of the next state and/or
current state (and perhaps the direct inputs) of the machine.
5.2
Encoding Schemes
Each state of a state machine must be mapped into some type of encoding (pattern of
bits). Once the state has been mapped, it is then considered a defined (legal) state.
Depending on the encoding scheme and the number of states within the design, there can
be “unused” encoded patterns – i.e. there is no defined mapping from the encoded pattern
to a state. In such a case, if there is a fault within the circuitry and the state machine
jumps into one of these unmapped (illegal) states, then the circuit has the potential to
become locked (or stuck) in some unreachable condition (undefined state).
The current most popular encoding schemes used by designers are: One Hot, Binary, and
Gray. The reader should be aware that there are many other schemes; however, this paper
will only address Binary and One Hot – plus a special encoding technique specifically for
correction.
Registers: binary
encoding
Good state : SEND_DATA
STATES (5):
1
0
0
IDLE
:000
GET_DATA
:001
PROCESS_DATA:010
BAD_DATA
:011
SEND_DATA
:100
1
1
0
Bad state: unmapped
Registers: One
Hot encoding
STATES (5):
Good state : SEND_DATA
IDLE
:00001
GET_DATA
:00010
PROCESS_DATA:00100
BAD_DATA
:01000
SEND_DATA
:10000
1
0
0
0
0
1
1
0
0
0
Bad state: unmapped
Figure 5: Binary and One Hot Encoding Schemes - Mapped States vs. Unmapped States
5.2.1 Binary
The Binary Encoding technique maps states into a base 2 counting scheme.
(Log2(NStates)) registers are required – i.e. for 5 states (NStates), 3 registers (NReg)
are required. There will always be 2NReg – Nstates unmapped states.
Because 2NReg – Nstates will always be less than NStates (theory of binary of
encoding), the probability of flipping into a mapped state is higher than flipping into an
unmapped state (IDLE or error state was not including in this deduction). Plainly
speaking, it is more likely to jump into “any” of the mapped states than the IDLE state
(the word “any” brings the non-deterministic nature to this problem).
State(1) Flips upon SEU:
Using the “Safe” attribute will transition the user to
a specified legal state upon an SEU
2
1
0
1
0
0
1
1
0
Good State
STATES (5):
Illegal State:
unmapped
IDLE
TURNON_A
TURNOFF_A
TURNON_B
TURNOFF_B
:000
:001
:010
:011
:100
Using the “Safe” attribute will not detect the SEU:
This could cause detrimental behavior
2
1
0
0
0
1
0
1
1
Good State:
TURNON_A
legal State: TURNON_B
Figure 6: SEU and Binary Encoding Response
5.2.2 One-Hot
The One-Hot encoding requires only one bit to be turned on at a time, i.e. each state is
mapped to one bit. One Hot state-machines require more registers than Binary or Gray
thus ultimately have more “illegal states”. However, it takes two bits to flip in order to
transition into a mapped state. This makes detecting errors within One-Hot easy. If
implemented correctly, this can make transitioning from an unmapped state deterministic.
5.2.3 Binary vs. One Hot
There are many debates over which technique is best for fault detection implementation:
one-hot vs. sequential. The two styles differ such that if a bit has flipped (a fault) the
binary value of the one-hot encoding will no longer be “One-Hot” (more than one bit will
be a ‘1’ or all bits will be ‘0’). Thus the error detection lies within the binary
representation.
With Binary Encoding this is not true. The reasoning is that a flipped bit (a fault) can
represent a “legal” state. Therefore, combinatorial error detection will not suffice – the
logic needs memory to hold the parity of the good state.
Some designers believe that the use of the extra parity bit (extra DFF within the state
machine) can be beneficial for fault detection in sequential machines. This is a false
statement. There are many scenarios that can be presented that show the implications of
assuming a level of fault detection while implementing sequential encoding and parity.
Most of them are based on the premise that multiple paths within the circuitry have
different routing delays and if an SEU happens near a clock edge, unpredictable results
can occur. The obvious problem is that an incorrect transition can occur without any
detection. This is a phenomenon of a “hamming distance of one” encoding scheme such
as the sequential state machine. Employing a hamming-1 encode is not a recommended
practice while implementing fault detection and correction.
6
FAULT TOLERANT STATE MACHINE IMPLEMENTATION
The main objective of a fault tolerant state machine is to be able to detect an error (a
flipped bit in the current state registers) and have a deterministic response within a
deterministic time frame. The definition of the response and its response time is
dependent on the design requirements. For example, the response may be as simple as
indicating that an error occurred and waiting for a signal to bring the state machine to a
known “working” state… or the response may be as complex as automatic correction of
the error within one clock cycle.
6.1
State Machines and the Bucket Approach
Most designers believe that they are implementing a bucket-approach to fault tolerance
when using the default clause (VHDL- when others) within a CASE statement. However,
the synthesis tools ignore these statements when synthesizing state machines. Most
designers are not aware that these statements (that suggest “bucketing” unused states and
supplying a transition out of a fault condition) have no bearing on the actual gates being
created. The reasoning behind this is that if the synthesis tool were to implement all
unused states, the required area would be excessive.
Synthesis tools have a directive that the designer can use called “safe”. When applying
this directive to a state machine, the tool can produce a bucket of illegal states. However,
the designer must be forewarned that using the “safe” directive has many flaws:
1. It generally creates more gates than desired, i.e. buckets of illegal states can become
very large.
2. Synplify will implement a “safe” one-hot but Leonardo and Precision will only
implement a sequential state machine (binary or Gray encoding) while using the “safe”
directive.
6.2
How Safe is the “SAFE” Option?
Recently, there has been a surge of designers using the “safe” options offered by several
different synthesis tools. They have defined a safe state machine as one that will always
transition to a known state - i.e., if an SEU occurs and an illegal or unmapped state is
reached, one will recover to a known state.
Synthesis tools offer a “Safe” option (demand from the Aerospace industry):
TYPE states IS ( IDLE, GET_DATA, PROCESS_DATA,
SEND_DATA, BAD_DATA );
SIGNAL current_state, next_state : states;
attribute SAFE_FSM: Boolean;
attribute SAFE_FSM of states: type is true;
Although this scheme may appear to be “fool-proof”, it is not. The “safe” option will not
always transition to a reliable (or known) state. Some have defined the word “known” as
mapped. However, the idea is to be deterministic. The idea is to always flip into some
“ERROR” or “IDLE” state upon getting an SEU. Flipping into a mapped state, in which
the rest of the system may not know the state machine is there, is not deterministic. It is
not good enough to transition to a mapped state:
 The machine can get stuck waiting for an input that will never occur (rest of the
system doesn’t know that you have flipped into this mapped state).
 Unexpected signals can turn on (or off).
 And many more reasons…
The compilers are geared towards implementing the “safe” option with Binary Encoding.
In such a case, there exists a false sense of safety. For example: if an SEU transitions the
state machine (i.e. one of the registers gets hit and flips) into a mapped (legal) state
(however it is an illegal transition to that state) an extreme fault can occur – there can be
severe output behavior with no error detection.
Most importantly, note that in a Binary Encoded State
Machine there will always be a higher probability of
flipping into a mapped state (“good”) state, than there is
flipping into an unmapped state.
Unfortunately, Binary Encoded state machines can be faulty. The designer must consider
the list of potential problems before using the safe directive as the method for fault
detection.
An alternate approach beyond using the synthesis “safe” directive should be taken to
ensure trust worthy fault tolerance. This paper will first address single-bit error detection
techniques. Afterwards, a robust single-bit error correction scheme along with a new
encoding technique will be presented.
6.3
Single-Bit Error Detection and Recovery
This section will focus on detecting that an error has occurred within one of the state
machine DFFs and then recovering from the fault. Remember that it is necessary for the
designer to manually provide recovery logic.
6.3.1 One-Hot
The beauty of the one hot encoding style stems from the fact that each state has a
hamming distance of 2 (it takes 2 transitions to get from one state to another) thereby it
inherently has SEU error detection. During normal operation of a one hot state machine,
only one bit is turned on – indicating which state is current. Thus the current state register
should always have an odd parity. Each transition requires 2 bits to flip (one bit turns off
while one bit turns on). If there is an SEU within the state machine, only one bit will flip.
The parity will then switch to even.
Example: Using the “five state” example, assume the circuit is in GET_DATA (“00010”
- odd). AN SEU can cause the state machine to have the encoded pattern of “00000”.
However, “00000” is not mapped to any state and is easily detected because it has even
parity. It is thus considered an undefined or illegal state for the state machine. The same
theory can be used if the state machine was in state GET_DATA and the current state flip
flops changed to: “10010”.
6.3.2 Using Multiplexer Control for Next State Transitions
The most efficient means of SEU error detection and recovery in one-hot state machines
is the use of a combinatorial parity checker over the one-hot current state register set. The
designer would use the output of the error detection logic to either place the state
machine into its normal operational next state at the following clock edge or to set the
machine into a designated error state (could also be as simple as going back to a reset
state) upon detecting an SEU. The fact that the error detection is purely combinatorial
logic is a plus because it contains no DFF’s and thus the detection logic cannot create a
fault of its own.
Clock
Error State Pattern
Inputs
Q
CLR
Q
Q
CLR
D
SET
Q
D
SET
Outputs
Metastability filter
to synchronize
SEU
Current State
MUX
Next State
XNOR
combinational logic
Figure 7: SEU Error Detection within One Hot Encoding
Note: The user must declare the state machine (using attributes) as one-hot to ensure that
the encoding is as expected.
6.4
Error Correction
Notice that this section discusses Correction which is distinctively different than
detection. There are many publications on the theory of Error Correction. However, none
directly address how to correctly implement state machine fault correction while using
current day synthesis tools. As mentioned earlier, synthesis tools focus on area reduction
and speed – the tools will want to “optimize” away the necessary redundant logic for
fault tolerance. Thus the designer is left to manually create the desired fault tolerant logic.
6.4.1 Error Correction Implementation using Hamming 3 Encoding
One method of correction is based off of creating a substantial hamming distance
between state encodings. Each state has a base encoding plus companion encodings.
The companions are every sequence that can be a hamming distance of one away (only
one bit is different) from the base state encoding (thus representing a possible SEU). All
of the base encodings must be a hamming distance of three away from each other. The
following is an example of a four state implementation – States A,B,C,D:
1. Find an encoding such that the states have a hamming distance of 3 (at least
3 bits must be different from state to state)...
 00000 (state-A)
 11100(state-B)
 01111(state-C)
 10011(state-D)
Five bits are necessary to encode a four-state machine in order to achieve the required
hamming distance of three.
2. For each encoding, calculate the companion encodings such that the hamming distance
is one... for example - the companion encodings for state A (00000) is:
00001,00010,00100,01000,10000 (only one bit differs from state A for each companion
state); B(11100) is 11101,11110,11001,10100,01100 ... and so on...
3. Now when implementing the state machine, state A is actually 00000 plus its
companion encodings so that a possible SEU is covered (do the same for all other states).
6.4.2 Error Correction and Glitch Control
One major issue that is extremely overlooked is SEUs occurring near clock edges. When
implementing fault tolerance, the designer must be aware of all scenarios. If the designer
does not carefully code the state machine in HDL, glitches (static hazards) can occur. It is
left to the designer to shut off the synthesis optimization in order to ensure glitch-free
state machine transitions.
In a synchronous design, glitches are generally not a problem as long as all critical paths
meet timing, outputs are registered, and any asynchronous signals are synchronized to the
system clock. However, an SEU can happen at any time and create dangerous
asynchronous behavior – i.e., a glitch near a clock edge. In order to safe guard against
glitchy circuitry, the designer must understand how glitches are formed.
Referencing a Karnaugh-Map, prime implicants are formed by grouping together
adjacent, equivalent values. When the initial and final inputs are covered by the same
prime implicant, no glitch can occur. However, when the input change spans prime
implicants, a glitch can occur. The beauty of the proposed method of error correction is
that hazard-free prime implicants (SOP – Sum of Products) can be easily derived and
hand coded because all companion states are a hamming distance of one away from the
base state. For simplicity, the K-Map presented in Figure 3 only shows 4 dimensions.
State(0)
00
00
1
01
1
01 11
1
10
1
State(3)
11
State(2)
10
1
State(1)
StateA companion states SOP (including State(4) dimension):
State(0)State(1)State(2)State(3) + State(0)State(1)State(2)State(4) +
State(0)State(1)State(3)State(4) + State(0)State(2)State(3)State(4) +
State(1)State(2)State(3)State(4)
Figure 8: Karnaugh Map and Sum of Products (SOP) for StateA and its Companion States
As one can see from the K-Map in Figure 8, if the current state is “00000” and an SEU
occurs, there will be a glitch-free transition into one of the companion states as long as
the SOP contains the prime implicants circled in the K-Map.
7
CONCLUSIONS
The usage of state machines in order to efficiently control mechanisms within a design
has become very popular. This paper proposes methods of Fault Tolerant State Machine
implementation due to potential IC SEU susceptibility. Before choosing any particular
technique, the designer should first be aware of the probability of electrical faults and
then decide the appropriate level of coverage necessary to achieve a robust design.
Special directives must be used in order to drive the synthesis tools when implementing
fault tolerant redundant logic because the tools are generally focused on area and speed
optimization. Thus, once the gates are produced, the designer should check that no
functionality has been “optimized” away and that the appropriate state machine has been
realized.
8
REFERENCES
[1] Douglas Smith, HDL Chip Design, Doone
Publications, Madison, AL, USA, 1996
[2] Parag K. Lala, Self-Checking and FaultTolerant Digital Design, Academic Press,
2001
[3] Randy H. Katz, Contemporary Logic
Design, Benjamin/Cummings Publishing
Company, Inc., 1994
[4] PrecisionTM Synthesis Reference
Manual, Mentor Graphics Corp.,
Release 2003b.
Download