Copilot: A Hard Real-Time Runtime Monitor Lee Pike | Galois, Inc. | leepike@galois.com joint work with Alwyn Goodloe | National Institute of Aerospace Robin Morisset | École Normale Supérieure Sebastian Niller | Technische Universität Ilmenau QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompressor are needed to see this picture. QuickTi me™ and a decompressor are needed to see thi s pi cture. Need How do you know your embedded SW won’t fail? Certification (e.g., DO-178B) is largely process-oriented Probably not formally verified • Even if so, may make bad assumptions • E.g., faults, hardware behavior, program semantics, timing, etc. Unanticipated faults lead to unexpected behavior Unanticipated behavior can put life at risk Need to detect/respond at runtime © 2010 Galois, Inc. Just the FaCTS, Ma’am The Constraints A runtime monitoring solution must satisfy the FaCTS: Functionality: don’t change the target’s behavior Certifiability: don’t require re-certification (e.g., DO-178B) of the target Timing: don’t interfere with the target’s timing SWaP: don’t exhaust size, weight, power reserves © 2010 Galois, Inc. State of Affairs of Embedded SW Runtime Verification Most runtime monitoring approaches violate at least one FaCTS constraints: • Inlining monitors changes timing properties and makes a new program (re-certification?) • Few constant-time/space monitor synthesis tools Embedded C has not received the same attention as Java © 2010 Galois, Inc. Outline 1. A solution: the Copilot language and compiler 2. Pilot-study1: injecting software faults in a fault-tolerant airspeed system 3. Monitor correctness 4. Conclusions 1Pun intended. © 2010 Galois, Inc. Copilot Design (1/2) Target: hard real-time embedded systems • Common for guidance, navigation, and control systems Monitoring by sampling data at fixed time period • Could be C variables/arrays, memory-mapped registers, etc. • Adequate for hard real-time target programs, driven primarily by time, not events • Doesn’t require inlining monitors © 2010 Galois, Inc. Copilot Design (2/2) Stream / data-flow specification language • Think: Haskell infinite lists Generates embedded ANSI C99 programs • Can run “bare” on minimal microprocessors Does not modify or instrument target software • Generates it's own schedule---no RTOS needed – Distributes monitor processing across the schedule • But could run as one high-priority RTOS task Constant-time & constant-space code • Very small data footprints • Easier WCET estimation Side effects impossible • A tenet of dataflow languages © 2010 Galois, Inc. Copilot Implementation: the eDSL Approach Macro languag e A Haskell embedded domainspecific language (eDSL) -e.g., a library Types, parser, lexer, macro language... for free • ...And probably correct Front-end Atom http://hackage.haskell.org/package/atom Atom eDSL is the backend • Originally developed by Tom Hawkins (huge thanks to Tom!) • Very useful itself for writing more procedural real-time C code © 2010 Galois, Inc. Copilot Haskell “Compile” and “interpret” are functions • Interpreter (almost) for free Interpreter Stream Semantics (Append) all operators are lifted in Copilot x .= [0, 1, 2] ++ (varW64 x + 3) f [0, 1, 2] (in Copilot) (in Haskell) where f :: [Word64] -> [Word64] f x = x ++ f (map (+3) x) x = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ... x = [0, 1, 2] (+3) [3, 4, 5] (+3) [6, 7, 8] ... © 2010 Galois, Inc. Stream Semantics (Drop) x .= [0, 1, 2] ++ (var x + 3) y = drop 2 (var x) x = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ... y = 2, 3, 4, 5, 6, 7, 8, 9, 10 ... © 2010 Galois, Inc. Timed Semantics Period: duration between discrete events Phase: offsets into the period x1(); x2(); Example: • x: period 4 phase 1 • y: period 4 phase 3 Copilot ensures synchronization between streams • Assuming synchronization of phases in distributed systems: no non-faulty processor reaches the start of phase p+1 until every nonfaulty processor has started phase p © 2010 Galois, Inc. Example Copilot Specification “If the temperature rises more than 2.3 degrees within 2 seconds, then the engine has been shut off.” (period == 1 sec) initial “don’t care” values external variable engine :: Streams engine = do temps .= [0, 0, 0] ++ extF temp 1 overTempRise .= drop 2 (varF temps) > 2.3 + var temps trigger .= var overTempRise ==> extB shutoff 2 variables are strings x :: String x = “x” © 2010 Galois, Inc. phase to sample in Sample Code Generated (Incomplete) state-update function for trigger stream engine :: Streams engine = do ... trigger = (var overTempRise) ==> (extB shutoff 2) external variable sample function /* engine.sample__shutoff_2 */ static void __r6() { bool __0 = true; bool __1 = shutoff; if (__0) { } engine.tmpSampleVal__shutoff_2 = __1; } © 2010 Galois, Inc. /* engine.updateOutput__trigger */ static void __r0() { bool __0 = true; bool __1 = engine.tmpSampleVal__shutoff_2; bool __2 = ! __1; float __3 = 2.3F; uint64_t __4 = 0ULL; uint64_t __5 = engine.outputIndex__temps; uint64_t __6 = __4 + __5; uint64_t __7 = 4ULL; uint64_t __8 = __6 % __7; float __9 = engine.prophVal__temps[__8]; float __10 = __3 + __9; uint64_t __11 = 2ULL; uint64_t __12 = __11 + __5; uint64_t __13 = __12 % __7; float __14 = engine.prophVal__temps[__13]; bool __15 = __10 < __14; bool __16 = __2 && __15; bool __17 = ! __16; if (__0) { } engine.outputVal__trigger = __17; } Types Types: Int & Word (8, 32, 64), Float, Double Each stream has a unique type: x .= ([0, 1] ++ (varW64 x + 3) :: Spec Word64) Type-inference to minimize “type hints”: x .= extI32 a 1 > extI32 b 2 y .= [0, 1] ++ (varW64 y + 3 + var y) Casting • Implicit casting is a type-error: x .= varW64 y + varW32 z • Explicit casting guarantees: – signs never lost (no Int --> Word casts) – No overflow (no cast to a smaller width) x .= [True] ++ not (var x) y .= castI32 (varB x) + castI32 (castW8 (varB x)) © 2010 Galois, Inc. “Sir, it’s eDSLs all the way down!” (1/2) Don’t like the stream language interface? Then define another DSL over top. We’ve done just that for • LTL • past-time LTL • Statistical properties (in-progress) Think of these as both new DSLs or as Copilot Libraries. Because they’re all in the same host-language, they can be intermingled. © 2010 Galois, Inc. “Sir, it’s eDSLs all the way down!” (2/2) Example: "If the engine temperature exeeds 250 degrees, then the engine is shut off within one second, and in the 0.1 second following the shutoff, the cooler is engaged and remains engaged.” (period == 0.1 sec) engine :: Streams engine = do temp `ptltl` alwaysBeen (extW8 engineTemp 1 > 250) cnt .= [0] ++ mux (varB temp && varW8 cnt < 10) (varW8 cnt + 1) (varW8 cnt) off .= (varW8 cnt >= 10 ==> extB engineOff 1) cooler `ptltl` (extB coolerOn 1 `since` extB engineOff 1) monitor .= varB off && varB cooler © 2010 Galois, Inc. Copilot Language Restrictions Design goal: make memory usage constant and “obvious” to the programmer No anonymous streams No lazily-computed values • E.g. x .= [0] + (varW16 x) + 1 y .= drop 2 (varW16 x) Other restrictions (see paper) Upshot: “WYSIWYG memory usage” • Memory constrained by number of streams • Memory for each stream is essentially the LHS of ++ • Doesn’t include stack variables © 2010 Galois, Inc. Timing Info & Expression Counts Timing info Expression count Period -----3 3 3 3 3 3 3 Phase ----0 0 0 1 1 2 2 Exprs Rule ----- ---18 engine.updateOutput__trigger 14 engine.updateOutput__overTempRise 3 engine.update__temps 7 engine.output__temps 2 engine.sample__temp_1 6 engine.incrUpdateIndex__temps 2 engine.sample__shutoff_2 ----52 Hierarchical Expression Count helps Total -----52 6 7 2 2 14 18 3 Local -----0 6 7 2 2 14 18 3 Rule ---engine incrUpdateIndex__temps output__temps sample__shutoff_2 sample__temp_1 updateOutput__overTempRise updateOutput__trigger update__temps Generated engine.c and engine.h Moving engine.c and engine.h to ./ Calling the C compiler ... gcc ./engine.c -o ./engine -Wall © 2010 Galois, Inc. with WCET analysis ... Interpreter / Semantics (in one slide) interpret :: Streamable interpret inVs moVs s = case s of Const c -> Var v -> PVar _ v _ -> PArr _ (v,s') _ -> Append ls s' Drop i s' F f _ s' F2 f _ s0 s1 -> -> -> -> F3 f _ s0 s1 s2 -> © 2010 Galois, Inc. a => Vars -> Vars -> Spec a -> [a] repeat c getElem v inVs getElem v moVs map (\i -> getElem v moVs !! fromIntegral i) (interpret inVs moVs s') ls ++ interpret inVs moVs s' drop i $ interpret inVs moVs s' map f $ interpret inVs moVs s' zipWith f (interpret inVs moVs s0) (interpret inVs moVs s1) zipWith3 f (interpret inVs moVs s0) (interpret inVs moVs s1) (interpret inVs moVs s2) Download, Develop, http://leepike.github.com/Copilot/ Use BSD3 BSD3 © 2010 Galois, Inc. Just a cabal install away Usage compile spec “c-name” [opts] baseOpts interpret spec rounds [opts] baseOpts test rounds [opts] baseOpts • quickChecking the compiler/interpreter verify filepath int • SAT solving on the generated C program help (commands and options) [spec] (parser) Opts (incomplete list): • • • • • C trigger functions Ad-hoc C code (library included for writing this) Hardware clock Verbosity GCC options © 2010 Galois, Inc. Interlude: Pitot Failures Failures cited in Northwest Orient Airlines Flight 6231 (1974)---3 killed • Birgenair Flight 301, Boeing 757 (1996)---189 killed • Tape left on the static port(!) gave erratic data Líneas Aèreas Flight 2553, Douglas DC-9 (1997)---74 killed • Freezing caused spurious low reading, compounded with a failed alarm system • Speed increased beyond the plane’s capabilities Air France Flight 447, Airbus A330 (2009)---228 killed • • One of three pitot tubes blocked; faulty air speed indicator Aeroperú Flight 603, Boeing 757 (1996)---70 killed • Increased climb/speed until uncontrollable stall Airspeed “unclear” to pilots QuickTime™ and a decompressor are needed to see this picture. Still under investigation Not an exhaustive list! © 2010 Galois, Inc. Test TBed Representative of fault-tolerant systems 4 X STM microcontrollers ARM Cortex M3 cores clocked at 72 Mhz MPXV5004DP differential pressure sensor • Senses dynamic and static pitot tube pressure • Pitot tubes measure airspeed Designed to fit UAS (unpiloted air system) • Size, power, weight,... © 2010 Galois, Inc. QuickTime™ and a decompressor are needed to see this picture. Test Bed Architecture QuickTime™ and a decompressor are needed to see this picture. © 2010 Galois, Inc. Aircraft Configuration (1/2) Edge 540T-R1 QuickTime™ and a BMP decompressor are needed to see this picture. © 2010 Galois, Inc. Aircraft Configuration (2/2) QuickTime™ and a BMP decompressor are needed to see this picture. © 2010 Galois, Inc. Copilot Monitors Introduced software faults to be caught by Copilot monitors: Abrupt airspeed change: airspeed > 100 m/s Majority assumption: • Used the Boyer-Moore majority vote algorithm • Finds majority in exactly one pass... • ...If a majority exists, arbitrary value otherwise • Monitor checks if returned value is majority Voting agreement: • Check agreement between the voted values • Uses coordinating distributed monitors © 2010 Galois, Inc. Now Execute the Test Suite QuickTime™ and a YUV420 codec decompressor are needed to see this picture. © 2010 Galois, Inc. Monitoring Results Monitoring approach worked and did not disrupt the FaCTS properties of the observed system • Under 100 C expressions • No more than 16 expressions per phase • Binaries on the order of 10k Monitoring via sampling works for periodic tasks Off-by-one (phase) in one monitor, so spurious data • Don’t write monitors at 2am before the flight test :) © 2010 Galois, Inc. Correctness (1/2) Who watches the watchmen? -Wall (Haskell) -Wall (gcc) • Of course, you can use lint, valgrind, etc. if you like Copilot is typed (for free!) by Haskell: • Statically typed: type-checking at runtime • Strongly typed: no implicit type conversions – Makes hacks in the compiler hard---this is good “QuickCheck” equivalence checker between the Copilot interpreter and generated C code. • Caught a subtle dependency bug © 2010 Galois, Inc. Correctness (2/2) Using CBMC “out of the box” Bounded Model Checker for ANSI-C • Google: cbmc ansi-c Provide a depth to unroll the Copilot program Proves absence of • buffer overflows (both lower and upper) • pointer deferences of NULL • division by zero • floating point computations resulting in not-a-number (NaN) • uninitialized local variables use © 2010 Galois, Inc. “Why not just use Esterel?” x Not certified Open-source (BSD3), free • A good platform for open research & education • Extensible (e.g., LTL libraries) Focused on monitoring, not control Correctness: • Copilot: ~3k LOCs, Atom: ~2k LOCs • Working on full proof of correspondence Need to confirm: • Smaller memory footprint • More control over memory usage, scheduling, etc. © 2010 Galois, Inc. Future Work Equivalence checking: interpreter <--> C code • Using SAT-solving on AIG structures, à la Cryptol (e.g., Erkök, Carlsson, Wick. Hardware/software co-verification of cryptographic algorithms using Cryptol, FMCAD, 2009) The steering problem • Mode change Fault-tolerant monitor generation • Monitors need 10-9 failures/hour reliability, too! Programmer assistance with scheduling Sampling rate vs. data history MC/DC coverage (should be nearly trivial) © 2010 Galois, Inc. Conclusions Problem space: hard real-time embedded C • Just the FaCTS, ma’am: – Functionality, certifiability, timing, SWaP • Monitoring by sampling Goal: build a language/compiler that • Is expressive enough -- e.g., ptLTL properties • But enforces austere memory usage requirements Use an eDSL approach • Improve the probability of correctness • Reduce the overhead of a new compiler Shameless plug: We’re looking for a 2010 summer intern: email <leepike@galois.com> Acknowledgements: Thanks to LaRC’s Safety Critical Aeronautics Systems Branch/D320 allowing us to fly with them. This work is supported by Contract NNL08AD13T and monitored by Dr. Ben Di Vito. © 2010 Galois, Inc. Appendix © 2010 Galois, Inc. Monitoring By Sampling Without inlining monitors, we must sample: Property (011)* False positive (monitor misses an fault): • Values are 0111011 but sampling 011011 False negative (monitor signals a fault that didn’t occur): • Values are 011011 but sampling 0111011 Observation: with fixed periodic schedule and shared clock • False negatives impossible • We don’t want to re-steer an unbroken system • False positives possible, but requires constrained misbehavior © 2010 Galois, Inc. Pitot Data QuickTime™ and a decompressor are needed to see this picture. © 2010 Galois, Inc. Distributed Monitors © 2010 Galois, Inc.