Evaluation of Computer Faults Due to to EM Interference

advertisement
Evaluation of Computer Faults
Due to to EM Interference
Concepts, Simulation Environment and Some
Results
Shantanu Dutt
(Student Involved: Hasan Arslan)
ECE Dept.
University of Illinois -Chicago
Outline





Past Work-- General Fault Detection and Tolerance, EMI
Faults
Our Goals
Our Methodology for Fault Detection and Classification
Experimental Results
Conclusions and Future Work
Past work – General Fault Detection
and tolerance


Off-line testing (mainly for hard faults)
Concurrent-online testing (operational faults):
Adding external hardware, monitoring data, address
and control lines
 Memory:error-detecting & correcting codes
 Computer systems


Watchdog processor – detecting control flow errors in
program execution [Mahmood & McCluskey, TC’88]
Algorithm-based fault tolerance: use of some property of
computation for self-checking [Huang & Abraham, TC’84,
Dutt & Assad, TC’96]
Past Work On EM/RadiationInduced Faults



Detection of high level computer failure due
to different types of EM signals [Mojert et al.,
EMC’01]
Failure in real-time communication & control
systems from communication line errors due
to EM signals [Kohlberg & Carter, EMC’01]
Also: Radiation Hardened Processors: Leon and
ERC32 processors (http sites). But primarily only ECC
for memory and register file---simple fault tolerance
but probably targeting the most likely source of
“permanent” faults.
Assumptions/Scenarios of Past Work

Past Work on general fault detection:




Random single (sometimes double) faults
Deterministic faults
Types of faults: permanent, transient, intermittent;
intermittent type not generally tackled
Past Work on EM-induced faults:

No how/why/what analysis and classification of
computer failure due to EM interference
Goals of Our Work

Will determine and classify the following type of computer
system behavioral error (i.e., program errors) due to different
patterns, extent, duration and location of faults:







Control flow errors -- incorrect sequence of instruction execution.
Causes: address gen. error, memory faults, bus faults
Data errors. Causes: computation errors, memory & bus faults
Termination Errors (hung processor & crashes). Causes: C.U.
transition to dead-end states, invalid instruction, out-of-bound
address, divide-by-zero, spurious interrupts (?)
Note: Error types are NOT mutually exclusive
Provide broad-based recipes for FT and reliable operation
To the best of our knowledge, more comprehensive analysis
of fault effects on a computer system than that attempted
previously
Comprehensive analysis needed due to the nature of EM
effects--all pervasive, periodic, clustered
Our System of Fault Analysis
in a Computer System
Fault Injection
Detection &
Classification
using CFC &
ABFT/Data Encoding
Use VHDL model of a
modern micro-proc---DLX
& SuperScalar DLX
Observation &
Error Classification
Computer Sys =
Processor
+ Memory
+ Et. Buses
In each comp; control
of fault duration, freq, #,
pattern (rand, clust)
Characteristics of Fault Injection
Methods -- Previous Work
Hardaware

With contact
Without contact
Software
Compilation
Runtime
Cost
High
High
Low
Low
Damage
High
Low
None
None
Trigger
Yes
No
Yes
Yes
Repeatability
High
Low
High
High
Controllability
High
Low
High
High
Acc. FIP
Chip pin.
Chip int.
Reg. Mem.
Soft.
Reg. Mem. I/O
cont./port
Our Fault Injection Approach
•Inject Faults in a “Software” Model (VHDL) of a Computer-adv of both the h/w and s/w approaches w/o the disadvantages
•Fault Types (stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc)
•Variable Duration of Faults & Frequency
Fault Generator
Counter_1
data
1 0
MUX
Memory
Counter_2
Var-width
Var-period
Pulse gen.
Data Bus
Address Bus
Fault Generator
DLX CPU
Methodologies for Control Flow Errors [Mahmood
& McCluskey, TC’88]



A node is a block of instructions with a branch at the end
A derived signature of a node is a function
(e.g., or, LFSR) of all its instructions
A program graph is one in which there is an arc from
node u to v if the branch at u can lead to node v
NOP Sign(v4)
ADD r1 r3
LD r2 address
v2
v3
MAIN
v1
v4
NOP sign(v5)
NOP sign(v6)
BLT r4 r8 off
Memory
Hierarchy
Watchdog
Memory Bus
Signal from
branch circuit
WD
v5
v6
Sign(v4)
BRT v6 v5
Processor
Methodologies for Control Flow Errors: CFC
Checking Using a Watchdog



WD compares the information gathered concurrently to the
information previously provided
Complexity,lies between the current circuit-level and system-level
tech.
90% error coverage for single errors [Mahmood et al, ieee tc’88]
Block
Header
NOP Sign(v4)
ADD r1 r3
LD r2 address
v3
v4
NOP sign(v5)
NOP sign(v6)
BLT r4 r8 off
WD
v5
v6
Checking
Sig. ok
v2
MAIN
v1
START
Compu
te
Block
Sign.
Wait for
new
block
Check
Branch
Sig.
Sign(v4)
BRT v6 v5
Check
Block
Sig.
Error
flag
Data Error Methodologies: AlgorithmBased Fault Tolerance



Difficult to detect, occurs inside the microproc, not
necessarily observable to an external WD processor
Use properties of the computation to check
correctness of computed data
E.g. linearly property: f(v1+v2)=f(v1)+f(2) of
computation f( ) can be used to check it





Pre-compute v’ = v1 + v1 + …+ vk (input checksum)
Computer f(v1), …..f(vk)
Compute u = f(v) + f(v2) + …. + f(vk) (output checksum)
Check if f(v’) = u; inequality indicates computation error(s)
Can be used for linear computations such as matrix
multiplication, matrix addition, Gaussian elimination
[Huang & Abraham, TC’84],[Dutt & Assad, TC’96]
Data Error Methodologies: Data
Encoding




Data that is numerically processed can be encoded
and checked if the output of arithmetic operations is
still encoded (e.g., Berger, AN codes)
A simple coding scheme is AN coding: # N is
transformed to A.N where A is odd, say, 3
Works for addition: 3.N1 + 3.N2 = 3(N1+N2) -check if result is still a multiple of 3; if not then error
100% det of single faults -- single fault will change
result by +/- (2^i) and so no longer multiple of 3.
Methodologies for Termination Errors

Valid address range registers R_low, R_high in
processor -- check generated address to see if in
range


Can detect crashes due to invalid addresses
Timeout Mechanisms -- Store upper bound exec time
for each block in the watchdog; if time is exceeded
during run time flag error

Can detect infinite loops or hung processor due to control
unit faults
Current Implementation






Fault Injection w/ various controls (duration,
frequency, extent, pattern) for a non-pipelined DLX
processor in VHDL
Fault injection on memory data/address buses
Description of a watchdog processor in VHDL for
control flow checking + infinite-loop termination
errors
Valid address range registers in processor
ECC (1-error correction and 2-error detection) of
memory (commercial feature) and buses (nonstandard)
Some error analysis results for a simple Fibonacci
computation: f(i) = f(i-1) + f(i-2), i=2 to 99, f(0)=f(1)=0
Current Implementation -- ECC
Capabilities on Memory and Buses
Fault Injector
Add r30,r0,r14
000ef020
En
4181ee8008
4380ee8818
Dec
En
Dec
rfe
410ee030
CPU
Memory
L. Adr. Reg.
PC
address
32+7ECC=39 bits
.
H. Adr. Reg
Some Error Observations
Adress
00000040
00000044
.
.
000000A4
000000A8
.
.
000000D4
000000D8
000000DC
000000E0
000000E4
000000E8
000000EC
.
.
0000010C
00000110
00000114
Orig. Instruction
ADDI R7, R0, 3
SW
0(R3), R7
.
.
LW
R3, -12(R30)
LHI
R4, 0
.
.
LW
R6, -12(R30)
SLLI R6, R6, 2
ADD R5, R5, R6
LW
R4, 0(R4)
LW
R5, 0(R5)
ADD R4, R4, R5
SW
0(R3), R4
.
.
LW
R3, -12(R30)
ADDI R3, R3, 1
SW
-12(R30), R3
Corrupted Inst.
ADDI
R7, R0, 3
SW
0(R2), R23
wrong
.
initializ.
.
LW
R3, -12(R30)
LHI
R4, 0
Invalid
.
Addr Err
.
LW
R6, -12(R30)
SLLI
R14, R4, 2
Ind. Addr.
ADD
R5, R5, R6
Err
LW
R4, 0(R4)
LW
R5, 0(R5)
ADD
R4, R4, R5
SW
0(R3), R4
Inc. unknown value
.
as index value
.
LW
R3, 1040(R14)
ADDI
R3, R3, 1
SW
-12(R30), R3
Some Error Observations(contd.)
Slli r5,r5,#4
0
56
TRAP
31
Trap_id
Trap_0 code=44000000
Add r4,r4,r5
50c60004
2 bit
faults
00a62820
Trap_0
44XXXXX4
•DLX uses TRAP_0 to stop exec. Processor checks first 6 bits (0-5) for Trap
instruction, and last bit (31) for trap_id. No other bit checked.
•For All ALU instructions, first 6 bits are always 0. When 2nd and 5th bits are set,
they become trap inst. Hence their distance is 2.
•For trap instructions if last bit is 0, then execution stops (Trap_0).
Unfortunately, for most ALU inst.(add,and,xor,rfe…etc), the last bit is also 0.
•DLX interprets the last 5 bits (27-31) as trap_id (bit 6-26 are ignored).
Non-trap instructions interpret bit 6-10 as src./dst. register.
•Check for trap/non-trap inst. extended to bit 6-10, to inc. min. dist. from 2 to 3.
•Premature stops due to trap_0 thus reduced.
•More refined schemes to increase min. distance -- on-going work
Experimental Setup: Fault
Injection Parameters
• 4 random errors simulated on the data bus w/ foll.
characteristics Repeat Period:
Clock cycle:
22 ns
10 ns - 800 ns
(f=100 Mz - 1.25 MHz)
Duration Range:
5 ns - 400 ns
R=20
Low_Low
Repeat Period
Range:
305 - 425
Med_Med
160 - 440
High_High
305 - 425
D=5
Duration Range:
5 - 25
180 - 220
300 - 400
Experimental Results
No. Addr. Corr. Trap_0
No. Addr. Corr. Trap_Fixed
Addr. Corr. Trap_0
Addr. Corr. Trap_Fixed
Avgr. Exec. Inst. For each sim.
(1134 inst. no fault)
Avrg. Exec. Inst. For each run
stopped by Trap.
When we fixed trap few runs is terminated because
of trap.But Invalid. Addr. Termination (IAT) error
increases
35.42 inst exec. When Sim stopped because of IAT.
38.37 inst exec. For second type
88.46 inst exec. (IAT) third type
124.54 inst exec. (IAT) 4th type
Experimental Results (Cont)
Simulation Times, Data Computation
No. Addr. Corr. Trap_0
No. Addr. Corr. Trap_Fixed
Addr. Corr. Trap_0
Addr. Corr. Trap_Fixed
Avgr. Exec. time. For each sim.
(265,620 ns for non faulty)
Avrg. Array. Elts Updates
When simulation runs more it calculates more
data elements
Experimental Results
T_0: No Addr. Corr. Trap_0: 410 err.
Error_density:7/100
43:Term. Err(%10) 14 trap 29 Inv.Addr
66 CF (%16) 301 Dat_Err. (%74)
52 simulation for Low_Low
100.0%
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
data. Err.
Cont.flow
Term.Err
T_F: No Addr. Corr. Trap_Fixed: 424 err.
Error_density:7/100
41:Term. Err(%9.6) 9 trap 32 Inv.Addr
76 CF (%18) 307 Dat_Err. (%73)
A_0: Addr. Corr. Trap_0: 444 err.
Error_density:2.2/100
38:Term. Err(%6.8) 11 trap 27 Inv.Addr
54 CF (%13.6) 315 Dat_Err. (%79.5)
A_F: Addr. Corr. Trap_Fixed: 446 err.
Error_density:1.7/100
T_0
T_F
A_0
A_F
27:Term. Err(%6) 6 trap 21 Inv.Addr
24 CF (%5.4) 395 Dat_Err. (%88.6)
The more program runs the more it gives Data Err.
When trap is fixed, more simulation is completed. But it increase the Inv. Addr. Term.
When Addr. corrected Inv. Addr. Err. is reduced.Simulation executes more instruction
It increase the Data Err.
Experimental Results (cont)
T_0: No Addr. Corr. Trap_0: 68 err.
Error_density:15/100
52 simulation for Med_MEd
100.0%
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
52:Term. Err(%76.5) 1 trap 51 Inv.Addr
3 CF (%4.4) 13 Dat_Err. (%19.1)
T_F: No Addr. Corr. Trap_Fixed: 82 err.
Error_density: 11/100
data. Err.
Cont.flow
Term.Err
52:Term. Err(%63.5) 0 trap 52 Inv.Addr
7 CF (%8.5) 23 Dat_Err. (%28)
A_0: Addr. Corr. Trap_0: 150 err.
Error_Density:10/100
52:Term. Err(%34) 7 trap 43 Inv.Addr
54 CF (%36)
44 Dat_Err. (%30)
T_0
T_F
A_0
A_F
A_F: Addr. Corr. Trap_Fixed: 175 err.
Error_density:8/100
45:Term. Err(%25.7) 8 trap 37 Inv.Addr
45 CF (%25.7) 85 Dat_Err. (%48.6)
Increasing fault inject period, reduces the
# of executed Inst. So error density increases terribly
Experimental Results (cont)
T_0: No Addr. Corr. Trap_0: 61 err.
Error_density: 35/100
52 simulation for High_High
100.0%
90.0%
80.0%
70.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
data. Err.
Cont.flow
Term.Err
52:Term. Err(%85) 1 trap 51 Inv.Addr
9 CF (%15) 0 Dat_Err. (%0)
T_F: No Addr. Corr. Trap_Fixed: 90 err.
Error_density: 48/100
52:Term. Err(%57) 0 trap 52 Inv.Addr
38 CF (%43) 0 Dat_Err. (% 0)
A_0: Addr. Corr. Trap_0: 93 err.
Error_density: 22/100
52:Term. Err(%55.3) 9 trap 43 Inv.Addr
41 CF (%43.6) 1 Dat_Err. (% 1.1)
A_F: Addr. Corr. Trap_Fixed: 52 err.
Error_Density: 26/100
Process never get able to calculate Fib.val
T_0
T_F
A_0
A_F
because of high fault injection.None of the
simulation is completed.
52:Term. Err(%100) 4 trap 48 Inv.Addr
0 CF (% 0)
0 Dat_Err. (% 0)
Error Coverage
For error coverage, we run our simulation 122 times for:
repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns
90
80
70
60
50
40
30
20
10
0
ECC cover.
Cont.Flow. Cov.
Data cov.
95
434 erroneous inst.
411 err. inst. covered by ECC (95%)
90 err. Inst. covered by WD (20%). We are injection
4 bit faults, If process jumps the middle of a block,
WD spends time to get beginning of block.
20
Total: 434 erroneous
inst. executed
T_0: No. Addr. Corr. Trap_0
Error Coverage(Cont.)
For error coverage, we run our simulation 122 times for
repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns
90
80
70
60
50
40
30
20
10
0
ECC cover.
Cont.Flow. Cov.
Data cov
95
474 erroneous inst.
450 err. inst. covered by ECC (95%)
106 err. Inst. covered by WD (23%). We are injection
4 bit faults, If process jumps the middle of a block,
WD spends time to get beginning of block.
23
Total: 474 erroneous
inst. executed
T_F: No. Addr. Corr. Trap_Fixed
Error Coverage (Cont.)
For error coverage, we run our simulation 122 times for
repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns
90
80
70
60
50
40
30
20
10
0
ECC cover.
Cont.Flow. Cov.
Data cov
83
3553 erroneous inst.
2959 err. inst. covered by ECC (83%)
703 err. Inst. covered by WD (20%). We are injection
4 bit faults, If process jumps the middle of a block,
WD spends time to get beginning of block.
20
13
Total: 3553 erroneous
inst. executed
A_0: Addr. Corr. Trap_0
There were 89 data error. 12 (13%) of
them covered by 3N coding
Error Coverage (Cont.)
For error coverage, we run our simulation 122 times for
repeat period: 300 ns - 500 ns. dura. range : 150 ns - 250 ns
90
80
70
60
50
40
30
20
10
0
ECC cover.
Cont.Flow. Cov.
Data cov
82
4219 erroneous inst.
3426 err. inst. covered by ECC (82%)
18
19
Total: 4219 erroneous
inst. executed
A_F: Addr. Corr. Trap_Fixed
762 err. Inst. covered by WD (18%). We are injection
4 bit faults, If process jumps the middle of a block,
WD spends time to get beginning of block..
There were 106 data error. 20 (19%) of
them covered by 3N coding
Error Coverage (cont)
For error cover., 20 runs selected that resulted in complete simulations
w/ combinations of period: 305 - 460 ns and dura. range : 5 - 60 ns
90
80
70
60
50
40
30
20
10
0
ECC cover.
Cont.Flow. Cov.
Data cov
80
438 erroneous inst.
349 err. inst. covered by ECC (80%)
39
23
Total: 438 erroneous
inst. executed
Addr. Corr. Trap_0
170 err. Inst. covered by WD (39%). We are injecting
4 bit faults. If process jumps the middle of a block,
WD spends time to get beginning of block.
There were 217 data error. 51 (23%) of
them covered by 3N coding
Conclusions





Have completed a significant but preliminary fault
simulation of the DLX processor in VHDL
Obtain % of termination, control and data errors for
different fault duration and frequencies
Encoding the TRAP instruction to have a min. distance
from other instructions helps in reducing incorrect
termination
Need to have ECC for register fields of instrs to reduce
incorrect address generation and data errors
It seems to be possible to catch most errors by the
combination of mechanisms we have suggested so at
least a fail safe mode can be guaranteed with high
confidence; though room for improvement for control &
data error detection
Future Work






Other fault patterns (e.g., clusters); correlation with
EM induced fault work by others in our group
Other block signature techniques (e.g., LFSR) with
better fault coverage
Aliasing analysis (math., empirical) for signatures
Perform error analysis for more substantial “real-life”
programs (scientific computations, non-numeric,
system or O.S.)
Fault injection and analysis for SuperScalar DLX
Fault injection and analysis of on-chip processor
components (integer and FP ALU, register files,
control unit, internal buses, power/ground lines)
Looking Further Ahead





Q: Are there patterns of errors that lead to computer crashes w/
high probability?
Q:If so, can the detection of such patterns be used to shut
down the computer in a fail-safe manner (save state & data for
later resumption)
Q:Are there patterns of errors that are characteristic of EMinduced faults versus random single/double faults?
Q:If so, can these be used as “early detection & warning” of EM
interference?
Future: Based on the correlation of system errors to
EM faults, determine fault tolerance/ error
minimization techniques for EM-induced faults.
Download