Why Do Airplanes Crash? An Open Source Air Data Inertial Reference Unit Investigation *** 2012 PSU/Galois Capstone Project Chris Andrews, Trang Nguyen, Mark Craig, Kayla Seliner Presentation Air Data Inertial Reference Unit •Our Project: building an open source ADIRU •Overview: what is an ADIRU? •Motivation: why are they important? •Fault Tolerance: types of faults. •Approach: voting methods. •Design: hardware and software architecture. •Results •Conclusion 2 Project Goals • Construct a small low power ADIRU system to deploy on an RC aircraft • Implement a Byzantine fault tolerant algorithm on a system of multiple microprocessors (voters) and sensors. • Use input from multiple sensors including gyroscopes, accelerometers, GPS, and airspeed. • Use open source hardware and software when possible. 3 Air Data Inertial Reference Units are an essential component in modern avionics systems •The ADIRU system collects and processes sensor values from accelerometers, gyroscopes, altimeters and airspeed indicators and functions as the single source of sensor data aboard the aircraft. •Many commercial aircraft including the Airbus A330 and Boeing 777 implement ADIRU units as part of their avionics suite. •The Air Data Inertial Reference Unit may itself be triple redundant. •The ADIRU system replaces earlier fault tolerant triple modular redundant systems. •Autopilot and unstable flight regimes depend upon valid and uninterrupted sensor data for safe flight. 4 TMR vs. ADIRU 5 Benefits of ADIRU Systems •Redundancy: redundant sensors make system less vulnerable to single sensor failure. •Modularization and fault containment: the ADIRU system is the single source of sensor data for all the cockpit instruments and avionics software on aircraft. •Deferred maintenance: sufficient margin of safety may be preserved in some systems to operate with small number of faulty components and avoiding expensive emergency repairs 6 ADIRU Vulnerabilities • • System complexity Closed source, proprietary system 7 Triple Modular Redundant System • • • Votes on outputs of three redundant sensors. System can tolerate single sensor fault. Relatively simple to implement and diagnose. Byzantine Fault Tolerant System • • • • • At least 4 different voters each with a sensor. Tolerates fault in sensor or in voter. F faults require 3F+1 voters with sensors. Requires complex voting algorithm. Can survive class of faults not dealt with by TMR. 8 ADIRU failures are a critical event With serious consequences if the Aircraft is not in a visual flight mode. Retrieved May 24, 2012 from https://encryptedtbn2.google.com/images?q=tbn:ANd9GcSYkmoQuv2Uml0vQjTVrW6z0zXHMhr6MdlZkyQJhHD5D5h_2vwZA [1] Air France Flight 447 On May 31st, 2009, Air France Flight 447 enroute from Rio de Janeiro to Paris crashed into the atlantic ocean killing all passengers. Retrieved June 1, 2012 from: cdn.blogs.sheknows.com/ thewire.sheknows.com/2011/05/airfrance447.jpg 10 Sequence of Events Leading To Crash Corrupted Sensor Data: pitot tubes blocked with ice transmit Byzantine faults to ADIRU. Loss of Control: Autopilot disengages. Flight crew receive erratic and inconsistent airspeed data and stall the aircraft. No Recovery: Flight crew fails to recover from stall because crew cannot determine actual airspeed. Flight computer does not restart. Aircraft free falls into Atlantic. 11 Qantas Flight 72 On October 7th, 2008, Qantas Flight 72 enroute from Singapore to Perth suffered a malfunction in the ADIRU and flight computer causing a series of rapid descents that threw passengers and crew about the cabin. 12 Sequence of Events Leading To Incident ADIRU failure: A “spiky” series of measurements from the angle of attack sensors that measure aircraft pitch in relation to airflow exploited a vulnerability in the ADIRU software. Bad data is output to the flight controller from one of the ADIRU units. Flight Computer software failure: The flight computer under autopilot fails to filter the bad data and executes an abrupt dive of -0.8G . The flight crew disengages autopilot and makes an emergency landing at Learmouth, Western Australia. 13 How ADIRU Systems Fail •Failure of ADIRU may be intermittent and cause cockpit instrumentation to send contradictory warnings (stall and high speed). •ADIRU is the root of all sensor data for flight avionics. Failure in the ADIRU can instantly propagate throughout flight control system. •Failures of the ADIRU system effect both autopilot and manual flight modes 14 Multiple Sources of Failure Human Causes: Deferred maintenance can cause errors to accumulate until the ADIRU system fails. Environmental Causes: ADIRU systems interface with physical sensors outside the cabin that can be effected by ice and environmental conditions. Software: Software may hide bugs that appear under anomalous conditions. Most accidents have multiple causes. 15 Types of Faults •Fail Silent: system fails to send data. This fault is masked by a redundant system •Byzantine Failure: system sends arbitrary data including different data to different controllers. This fault cannot be masked by simple redundancy. 16 Project Requirements • Exhibit Byzantine and fail silent fault tolerance • Include fault injection • Must be able to mask faults • System must be expandable • Must follow open source guidelines 17 Build a four redundant network using arduino microcontrollers polling gyroscopes and accelerometers. Network with an I2C bus. 18 Features: I2C and Power Bus Environmental Enclosure Separate board for power supply 19 20 Reasons For Choosing Arduino Open Source Hardware and Software Large community of developers Libraries for I2C communication already exist Lowest hardware entry cost to develop a multi-module fault tolerant system • Quickest start time (no hardware developmen necessary) • • • • 21 Arduino ArduIMU+V3 Features: Atmega 328 uP 3D Accelerometer and 3D Gyroscope 3D Magnetometer 22 23 Software Algorithm Clock Synchronization Multi-Master I2C bus Byzantine Algorithm Fault injection 24 25 Safety critical systems should be able to handle failures of one or more of its components and continue to operate correctly. Byzantine faults consist of one or more components or subsystems sending inconsistent data to other components and subsystems. Handling these type of failures is known as the Byzantine Generals Problem. 26 The Byzantine generals problem guarantees fault tolerant behavior under the following premises. All loyal generals decide upon the same plan of action A small number of traitors cannot cause the loyal generals to adopt a bad plan. More than 2/3 of the generals must be loyal. Must have 3*N + 1 generals to handle N traitors. 27 General sends command to N-1 lieutenants. All loyal lieutenants obey the same command. If the general is loyal, then every loyal lieutenant obeys the command he sends. Each lieutenant communicates the command they received from the general to each other. Each lieutenant reaches a decision based on a majority vote of the commands received from the general and all other lieutenants. 28 1 2 2 3 4 3 4 2 4 3 3 2 1 4 4 3 3 4 2 4 1 4 3 4 1 1 1 2 2 1 1 4 3 2 2 1 3 3 2 1 29 [X, Y, Z] = [1, 1, 1] 1 [1, 1, 1] [1, 1, 1] 2 3 [1, 1, 1] [1, 1, 1] 4 3 [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] } 4 [1, 1, 1] 2 4 [1, 1, 1] [1, 1, 1] 2 [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] 3 [1, 1, 1] } 2 [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] 3 [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] 4 [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] } [1, 1, 1] 30 [X, Y, Z] = [1, 1, 1] 1 [1, 1, 1] [1, 1, 1] 2 [1, 1, 1] 4 3 [1, 1, 1] [1, 1, 1] [1, 1, 1] } 4 [1, 1, 1] 2 4 [1, 1, 1] [1, 1, 1] 2 [1, 1, 1] [1, 1, 1] [0, 0, 0] [0, 0, 0] 3 [1, 1, 1] [1, 1, 1] [1, 1, 1] Link Error between Module 1 and Module 4 [1, 1, 1] [1, 1, 1] [1, 1, 1] [0, 0, 0] 3 [1, 1, 1] } 2 [1, 1, 1] [1, 1, 1] [0, 0, 0] [1, 1, 1] [1, 1, 1] [0, 0, 0] 3 [1, 1, 1] [1, 1, 1] [0, 0, 0] [0, 0, 0] 4 [1, 1, 1] [1, 1, 1] [1, 1, 1] [1, 1, 1] } [1, 1, 1] 31 [X, Y, Z] = [1, 1, 1] 1 [1, 1, 1] [0, 0, 0] 2 4 [1, 1, 1] 3 2 [0, 0, 0] [1, 1, 1] [0, 0, 0] [0, 0, 0] [0, 0, 0] [1, 1, 1] 2 [0, 0, 0] [1, 1, 1] [0, 0, 0] } [0, 0, 0] [0, 0, 0] 3 [0, 0, 0] [0, 0, 0] Link Error between Module 1 and Module 4 as well as Module 1 and Module 2 4 [1, 1, 1] 4 [0, 0, 0] [0, 0, 0] [1, 1, 1] [0, 0, 0] 3 } 2 [1, 1, 1] [0, 0, 0] [1, 1, 1] [0, 0, 0] [0, 0, 0] [0, 0, 0] 3 [1, 1, 1] [0, 0, 0] [0, 0, 0] [0, 0, 0] 4 [0, 0, 0] [0, 0, 0] [0, 0, 0] [1, 1, 1] } [0, 0, 0] 32 33 34 35 Sensor reads are interrupt driven. Must synchronize clocks for all modules to ensure an “apples-to-apples” comparison of sensor values. Variable used to synchronize all modules is the Timer/Compare interrupt counter. By ensuring the counter is the same on all modules we can ensure that the interrupt that drives sensor reads occurs at the same time in all modules. 36 One module is dedicated to synchronizing the clocks of all other modules. Accuracy of clock synchronization is determined by Timer Interrupt clock speed and is approximately: Timing of clock synchronization cycles is set so that each device is synchronized to the master every few data cycles. This helps to ensure a tight synchronization as well as lessen the interference of data processing. 37 Master Request Slave Clock Value Send Back Slave Clock Value to Calculate Delay Send Clock Value Slave T1 T2 Send Clock Value T4 Calculate Delay Delay = (T4 – T2)/2 T6 Calculate Offset Offset = T6 – Master Clock Value Delay T3 T5 New Clock Value = Old Clock Value - Offset 38 The output displays the original clock value, the clock value from the master, the offset, and the new clock value. The offset is “0” because a delay of “1” was calculated. 39 Results System exhibits Byzantine fault tolerance. A system that is BFT requires 3F+1 voters. System masks fail silent faults (need a graphic to show this) 40 Budget 41 Conclusion Our system exhibits basic fault tolerant functionality. It demonstrates the feasibility of an open source fault tolerant project. 42 Further Work • Integrate GPS, magnetometer, altimeter and other sensors into the system. • Implement kalman filters in the SW to smooth out sensor noise. • Gather real data by launching aboard a vehicle. 43 Lessons Learned Interrupt routines on microcontrollers Debugging methods Code development: algorithm>python>C How to organize a large project involving hardware and software • Documentation • • • • 44 Acknowledgements We would like to thank our sponsors: Dr. Lee Pike and Galois Inc. We also acknowledge the help of our advisor: Dr. Christof Teuscher Portland State University 45 References 2. 46