Evaluation of Processor Faults Due to to EM Interference Concepts

advertisement
Evaluation of Processor Faults
Due to to EM Interference
Concepts and Simulation Environment
Shantanu Dutt, Hasan Arslan
ECE Dept.
University of Illinois -Chicago
Outline






Past Work-- General Fault Detection and Tolerance
Past Work – EMI Induce Faults
Fault Types and Fault Injection Methods
Proposed Work and System
Methodologies to Detect Faults
Question and Future Outlook
Past work – General Fault Detection
and Tolerance

Off-line testing of digital circuits

Self-diagnosis



Test each of functional block
Not a good system for real-time app.
Redundancy


Hardware, software or time
Have a high overhead penalty
Past work – General Fault Detection
and tolerance

Concurrent-online testing:
Adding external hardware, monitoring data,address
and control lines
 Memory:error-detecting & correcting codes
 Computer systems


Watchdog processor – detecting control flow errors in
program execution [Mahmood & McCluskey, TC’88]
Algorithm-based fault tolerance: use of some property of
computation for self-checking [Huang & Abraham, TC’84,
Dutt & Assad, TC’96]
Past work – General Fault Detection
and tolerance ( contd.)

Concurrent-online testing(contd.)


Reconfigurable Systems: On-line testing
and fault tolerance using dynamic circuit
reconfiguration
FPGA-based systems: On-line testing & FT
[Verma, M.S. Thesis, UIC’01], [Dutt, et al.,
ICCAD’99], [Mahapatra & Dutt, FTCS’99],
[Abramovici et al., ITC’99]
EM-Induced Faults

High level computer failure detection due to different types of
EM signals[Mojert et al., EMC’01]



Radiation therapy machine overdoses patients
Space Shuttle can’t launch due to synchronization error in
redundant computers
Failure in real-time communication & control systems from
communication line error due to EM signals [Kohlberg & Carter,
EMC’01]


SEUs (single-Event Upsets): potential threat to the reliability of
integrated circuits operating in radiation environment
Space/avionics application, due to heavy-energy particles.


Hubble’s Space Telescope
Ground level (atmospheric neutrons)

NASA space-based astronomical observatory
Fault Types & Fault Injection
Methods

Error Types




Control flow errors—incorrect sequence of instruction
execution. Causes: address gen. Error, memory faults, bus faults
Data Errors: Causes: computation errors, memory & bus faults
Hung processor & crashes: Causes: C.U. transition to dead-
end states, invalid instruction, out-of-bound address, divide-by-zero
Error types are NOT mutually exclusive
Fault Types & Fault Injection
Methods

Fault Injection Methods


Hardware Fault Injection
 with contact (voltage or current changes,use pin-level probes
and sockets)Messaline_[Arla et.al.,FTC’89 ]
 without contact (heavy-ion radiation and EMI) FIST_[Gunnetlo
et al.,FTC’89] MARS_[Karlsson er al.,DCCA’95]
Software Fault Injection
 Compile-time injection(modifying program instr. ) Doctor_[Han
et al., CPD’95]
 Runtime injection (trigger fault injection mechanism)
 Time-out
 Exception/trap Xception_[Carreira et al., DCCA’95]
 Code insertion
Ferrari_[Kanawati et al.,FTC’92] Ftape_[Tsai et al., FTC’96]
Fault Types & Fault Injection
Methods

Software Fault Injection (Contd.)


Adv.
 Don’t require expensive hardware
 Used to target application and operation
systems,which is difficult to do with hardware fault
injection
Disadv.
 Change the structure of original software
 Can not inject faults into location. That are
inaccessible to soft.
Fault Types & Fault Injection
Methods
Controller
Fault
Library
Fault Injector
Workload
library
Workload
generator
Monitor
Fault injection system
Data collector
Data analyzer
Target system
Characteristics of Fault
Injection Methods
Hardawere

With contact
Without contact
Software
Compilation
Runtime
Cost
High
High
Low
Low
Damage
High
Low
None
None
Trigger
Yes
No
Yes
Yes
Repeatability
High
Low
High
High
Controllability
High
Low
High
High
Acc. FIP
Chip pin.
Chip int.
Reg. Mem.
Soft.
Reg. Mem. I/O
cont./port
Proposed Work





VHD modeling of a modern microprocessor (using an available
VHDL description of the DLX microprocessor, with appropriate
modification)
VHDL-based introduction of fault injection logic in the CPU as
well as memory and external buses to simulate different fault
patterns likely caused by EMI
Develop techniques for detection of program errors due to these
faults
Classification of the fault types into data, control and
hung/crashed processor
Preliminary results for simulation of faults in external memory
address and data buses
Proposed Work
Location & Values of Faults
Fault Types (stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc)
Duration of Faults & Start Times
[0-50T] T= CPU clock cycle
[0,Texc(workload)] Texc: execution time without fault
Signal line
data
Memory
Counter_1
1 0
Counter_2
Var-width
Var-period
Pulse gen.
Data Bus
Address Bus
Fault Generator
DLX CPU
Proposed Work(contd.)


Will include similar fault-injection capability for on-chip wires with a
probabilistic component that will be based on analysis of EM effects on
p/g lines from the circuit analysis component
Processor will be partitioned onto 4 main modules: control unit, ALU,
register file & cache with separate or common p/g lines with these to
determine different degrees of susceptibility
p/g
Control
Unit
p/g Cache
p/g
Register File
ALU
p/g
Methodologies: Control Flow
Checking
•A watchdog: small co-processor,monitors the behavior of the
system
•Provided previously with information about the processor
to be checked(memory access, control flow,control signal ..)
Memory
Hierarchy
Watchdog
Memory
Bus
Processor


Signal from
branch circuit
Compares the information
gathered concurrently to the
information previously provided
Complexity,lies between the
current circuit-level and systemlevel tech.
Methodologies: Control Flow
Checking

_fibo:
sw -4(r14),r30
.
.
seq
r1,r3,r4
bnez
r1,L3
.
.
seq
r1,r3,r4
bnez
r1,L3
j L2
L3:
.
addi r1,r0,#1
j L1
L2:
..
..
n1


n2
n3
n4
L1

n5
Sign(n4)
BRT L1
A node is a block of inst. with a
branch at the end
A derived sign. of a node is a
function(e.g.,xor, LFSR) of all
instructions
A program graph is one in which
there is an arc from node u to v
if the branch at u can lead to
node v.
Based on the signature
Computation, error coverage is
high(>90%) even with multiple
faults[Mahmood & McCluskey,
FTCS’85]
WD
Examples of Error types
L1:
.
lw r3,0(r30)
addi r0,r0,#1
seq
r1,r3,r0
bnez
r1,L1
L2 .
.
subi r2,r2,#1
seq
r1,r3,r2
bnez
r1,L2
j L4
L3:
.
addi r1,r0,#1
j L1
L4:
..
..

Error types




Segmentation fault
r0=24
r3=25
Hung-processor
r2=1 r3=0
Out-of-bound address
L4=256
Invalid instruction

Instruction code can be
changed
Analysis of Error
Percentage of Error in Function of Fault
Duration
Percentage of Error in function of fault inject
frequency
60
40
% of error
% of error
50
30
20
10
0
10
20
30
Fault Duration
40
50
80
70
60
50
40
30
20
10
0
1
2
3
4
Fault inject Frequecny (%)
5
Analysis of Error




Program never finished (%47)
Program terminated incorrectly(%23)
Terminated with incorrect result (%23)
Terminated with correct result(%7)
Methodologies: Algorithm-Based
Fault Tolerance

Instruction execution errors




Difficult to detect, occur inside the
microprocessor,not observable to an external
watchdog processor
Off-line scheme for detecting execution errors due
to permanent faults[K.K. Saluja et al. IEEE
ITC’1983]
Transient fault occur more frequently than
permanent faults in digital systems
Detecting transient faults must be done in realtime
Methodologies: Algorithm-Based
Fault Tolerance


Use properties of the computation to check
correctness of computed data
E.g. linearly property: f(v1+v2)=f(v1)+f(2) of
computation f() can be used to check it





Pre-compute v’ = v1 + v1 + …+ vk (input checksum)
Computer f(v1), …..f(vk)
Compute u = f(v) + f(v2) + …. + f(vk) (output checksum)
Check if f(v’) = u; inequality indicates computation error(s)
Can be used for linear computations such as matrix
multiplication, matrix addition, Gaussian elimination
[Huang & Abraham, TC’84],[Dutt & Assad, TC’96]
Methodologies: Algorithm-Based
Fault Tolerance

Use a watchdog to monitor the bus and fetch the
instruction opcodes along with the main processor





Calculate expected execution parameters of each instruction
Store this information in the watchdog processor (instruction
parameter table)
Compare the fetched instruction parameters with the stored
data
If parameters do not match, give error message
Based on the program and microprocessor , error coverage
can be change.8086 instruction set, error coverage is around
%85 percent for single bit error [Khan & Tront, IEEE TC,
1989]
Goals,Questions & Future
Outlook





Q: Are there patterns of errors that lead to computer crashes w/
high probability?
Q:If so, can the detection of such patterns be used to shut
down the computer in a fail-safe manner (save state & data for
later resumption)
Q:Are there patterns of errors that are characteristic of EMinduced faults versus random single/double faults?
Q:If so, can these be used as “early detection & warning” of EM
interference?
Future: Based on the correlation of system errors to
EM faults, determine fault tolerance/ error
minimization techniques for EM-induced faults.
Download