Evaluation of Processor Faults Due to to EM Interference Concepts and Simulation Environment Shantanu Dutt, Hasan Arslan ECE Dept. University of Illinois -Chicago Outline Past Work-- General Fault Detection and Tolerance Past Work – EMI Induce Faults Fault Types and Fault Injection Methods Proposed Work and System Methodologies to Detect Faults Question and Future Outlook Past work – General Fault Detection and Tolerance Off-line testing of digital circuits Self-diagnosis Test each of functional block Not a good system for real-time app. Redundancy Hardware, software or time Have a high overhead penalty Past work – General Fault Detection and tolerance Concurrent-online testing: Adding external hardware, monitoring data,address and control lines Memory:error-detecting & correcting codes Computer systems Watchdog processor – detecting control flow errors in program execution [Mahmood & McCluskey, TC’88] Algorithm-based fault tolerance: use of some property of computation for self-checking [Huang & Abraham, TC’84, Dutt & Assad, TC’96] Past work – General Fault Detection and tolerance ( contd.) Concurrent-online testing(contd.) Reconfigurable Systems: On-line testing and fault tolerance using dynamic circuit reconfiguration FPGA-based systems: On-line testing & FT [Verma, M.S. Thesis, UIC’01], [Dutt, et al., ICCAD’99], [Mahapatra & Dutt, FTCS’99], [Abramovici et al., ITC’99] EM-Induced Faults High level computer failure detection due to different types of EM signals[Mojert et al., EMC’01] Radiation therapy machine overdoses patients Space Shuttle can’t launch due to synchronization error in redundant computers Failure in real-time communication & control systems from communication line error due to EM signals [Kohlberg & Carter, EMC’01] SEUs (single-Event Upsets): potential threat to the reliability of integrated circuits operating in radiation environment Space/avionics application, due to heavy-energy particles. Hubble’s Space Telescope Ground level (atmospheric neutrons) NASA space-based astronomical observatory Fault Types & Fault Injection Methods Error Types Control flow errors—incorrect sequence of instruction execution. Causes: address gen. Error, memory faults, bus faults Data Errors: Causes: computation errors, memory & bus faults Hung processor & crashes: Causes: C.U. transition to dead- end states, invalid instruction, out-of-bound address, divide-by-zero Error types are NOT mutually exclusive Fault Types & Fault Injection Methods Fault Injection Methods Hardware Fault Injection with contact (voltage or current changes,use pin-level probes and sockets)Messaline_[Arla et.al.,FTC’89 ] without contact (heavy-ion radiation and EMI) FIST_[Gunnetlo et al.,FTC’89] MARS_[Karlsson er al.,DCCA’95] Software Fault Injection Compile-time injection(modifying program instr. ) Doctor_[Han et al., CPD’95] Runtime injection (trigger fault injection mechanism) Time-out Exception/trap Xception_[Carreira et al., DCCA’95] Code insertion Ferrari_[Kanawati et al.,FTC’92] Ftape_[Tsai et al., FTC’96] Fault Types & Fault Injection Methods Software Fault Injection (Contd.) Adv. Don’t require expensive hardware Used to target application and operation systems,which is difficult to do with hardware fault injection Disadv. Change the structure of original software Can not inject faults into location. That are inaccessible to soft. Fault Types & Fault Injection Methods Controller Fault Library Fault Injector Workload library Workload generator Monitor Fault injection system Data collector Data analyzer Target system Characteristics of Fault Injection Methods Hardawere With contact Without contact Software Compilation Runtime Cost High High Low Low Damage High Low None None Trigger Yes No Yes Yes Repeatability High Low High High Controllability High Low High High Acc. FIP Chip pin. Chip int. Reg. Mem. Soft. Reg. Mem. I/O cont./port Proposed Work VHD modeling of a modern microprocessor (using an available VHDL description of the DLX microprocessor, with appropriate modification) VHDL-based introduction of fault injection logic in the CPU as well as memory and external buses to simulate different fault patterns likely caused by EMI Develop techniques for detection of program errors due to these faults Classification of the fault types into data, control and hung/crashed processor Preliminary results for simulation of faults in external memory address and data buses Proposed Work Location & Values of Faults Fault Types (stuck_at 0, stuck-at 1, single random, clustered, multiple random, etc) Duration of Faults & Start Times [0-50T] T= CPU clock cycle [0,Texc(workload)] Texc: execution time without fault Signal line data Memory Counter_1 1 0 Counter_2 Var-width Var-period Pulse gen. Data Bus Address Bus Fault Generator DLX CPU Proposed Work(contd.) Will include similar fault-injection capability for on-chip wires with a probabilistic component that will be based on analysis of EM effects on p/g lines from the circuit analysis component Processor will be partitioned onto 4 main modules: control unit, ALU, register file & cache with separate or common p/g lines with these to determine different degrees of susceptibility p/g Control Unit p/g Cache p/g Register File ALU p/g Methodologies: Control Flow Checking •A watchdog: small co-processor,monitors the behavior of the system •Provided previously with information about the processor to be checked(memory access, control flow,control signal ..) Memory Hierarchy Watchdog Memory Bus Processor Signal from branch circuit Compares the information gathered concurrently to the information previously provided Complexity,lies between the current circuit-level and systemlevel tech. Methodologies: Control Flow Checking _fibo: sw -4(r14),r30 . . seq r1,r3,r4 bnez r1,L3 . . seq r1,r3,r4 bnez r1,L3 j L2 L3: . addi r1,r0,#1 j L1 L2: .. .. n1 n2 n3 n4 L1 n5 Sign(n4) BRT L1 A node is a block of inst. with a branch at the end A derived sign. of a node is a function(e.g.,xor, LFSR) of all instructions A program graph is one in which there is an arc from node u to v if the branch at u can lead to node v. Based on the signature Computation, error coverage is high(>90%) even with multiple faults[Mahmood & McCluskey, FTCS’85] WD Examples of Error types L1: . lw r3,0(r30) addi r0,r0,#1 seq r1,r3,r0 bnez r1,L1 L2 . . subi r2,r2,#1 seq r1,r3,r2 bnez r1,L2 j L4 L3: . addi r1,r0,#1 j L1 L4: .. .. Error types Segmentation fault r0=24 r3=25 Hung-processor r2=1 r3=0 Out-of-bound address L4=256 Invalid instruction Instruction code can be changed Analysis of Error Percentage of Error in Function of Fault Duration Percentage of Error in function of fault inject frequency 60 40 % of error % of error 50 30 20 10 0 10 20 30 Fault Duration 40 50 80 70 60 50 40 30 20 10 0 1 2 3 4 Fault inject Frequecny (%) 5 Analysis of Error Program never finished (%47) Program terminated incorrectly(%23) Terminated with incorrect result (%23) Terminated with correct result(%7) Methodologies: Algorithm-Based Fault Tolerance Instruction execution errors Difficult to detect, occur inside the microprocessor,not observable to an external watchdog processor Off-line scheme for detecting execution errors due to permanent faults[K.K. Saluja et al. IEEE ITC’1983] Transient fault occur more frequently than permanent faults in digital systems Detecting transient faults must be done in realtime Methodologies: Algorithm-Based Fault Tolerance Use properties of the computation to check correctness of computed data E.g. linearly property: f(v1+v2)=f(v1)+f(2) of computation f() can be used to check it Pre-compute v’ = v1 + v1 + …+ vk (input checksum) Computer f(v1), …..f(vk) Compute u = f(v) + f(v2) + …. + f(vk) (output checksum) Check if f(v’) = u; inequality indicates computation error(s) Can be used for linear computations such as matrix multiplication, matrix addition, Gaussian elimination [Huang & Abraham, TC’84],[Dutt & Assad, TC’96] Methodologies: Algorithm-Based Fault Tolerance Use a watchdog to monitor the bus and fetch the instruction opcodes along with the main processor Calculate expected execution parameters of each instruction Store this information in the watchdog processor (instruction parameter table) Compare the fetched instruction parameters with the stored data If parameters do not match, give error message Based on the program and microprocessor , error coverage can be change.8086 instruction set, error coverage is around %85 percent for single bit error [Khan & Tront, IEEE TC, 1989] Goals,Questions & Future Outlook Q: Are there patterns of errors that lead to computer crashes w/ high probability? Q:If so, can the detection of such patterns be used to shut down the computer in a fail-safe manner (save state & data for later resumption) Q:Are there patterns of errors that are characteristic of EMinduced faults versus random single/double faults? Q:If so, can these be used as “early detection & warning” of EM interference? Future: Based on the correlation of system errors to EM faults, determine fault tolerance/ error minimization techniques for EM-induced faults.