Safety Critical Software Development - Suparna Paruthy Safety Requirements • • • • Customer interaction Similar product in same intended market Competitive intelligence Professional assistance Certification Killers • • • • • • Unclear requirements Lack of clear evidence of compliance Not doing research up front Lack of dedicated resources Trying to safety certify too many things Not accounting for enough resources to document the safety case • Not using a single contact to interface with the assessor • Not being honest with the weaknesses of the proposed system Project Planning Strategies 1. Determine the project certification scope early Identify which standards your product needs to meet. 2. Determine feasibility of certification Answer questions up front whether the product and solution are technically and commercially feasible. 3. Select an independent assessor Finding an assessor that has experience with your market segment. Project Planning Strategies (Contd.) 4. Understand your assessor’s role The assessor’s job is to assess your product with respect to compliance with standards and norms. 5. Assessment communication is key A clear line of communication between your team and the group controlling the standards 6. Establish a basis of certification Listing all of the standards and directives that your product needs to comply with. Project Planning Strategies (Contd.) 7. Establish a “fit and purpose” for your product Establishing a fit and purpose up front will prevent future headaches! 8. Establish a certification block diagram Generating a hardware block diagram of the system. 9. Establish communication integrity objectives Determine the “residual error” rate objectives for each digital communication path. 10. Identify all interfaces along the certification boundary Generating a boundary “Interface Control Document”. Project Planning Strategies (Contd.) 11. Identify the key safety defensive strategies Identifying and have the safety defensive strategies used to achieve the safety objectives for the program. 12. Define built in test (BIT) capability Identifying the planned BIT coverage, including initialization, periodic, conditional, and user-initiated. 13. Define fault annunciation coverage Keeping the system and user interface in mind, define which faults get annunciated. 14. Define reliance and expectation of the operator/user Clearly defining any reliance that is placed on the operator or user to keep the system safe. Project Planning Strategies (Contd.) 15. Define plan for developing software to appropriate integrity level Address the compliance with each element of the applicable standard you are certifying to. 16. Define artifacts to be used as evidence of compliance Listing all of the documents and artifacts you plan to produce as part of the system safety case. 17. Plan for labor-intensive analyses Plan on conducting a piece-part FMEA (failure modes and effects analysis). Project Planning Strategies (Contd.) 18. Create user-level documentation Plan on having a users’ manual. 19. Plan on residual activity Any changes to configuration must be assessed for the impact on safety certification. 20. Publish a well-defined certification plan Document a certification plan. Faults, Errors, and Failures • Fault: a characteristic of an embedded system that could lead to a system error. • Error: an unexpected and erroneous behavior of the system. • Failure: a system event not performing its intended function. Availability and Reliability • Availability: a measure of how much the embedded system will be running and delivering the expected services. • Reliability: the probability that an embedded system will deliver the requested services at a given point in time. • Systems that are both highly reliable and highly available are said to be dependable. Fault Handling • Fault Avoidance: developing a system that helps prevent the introduction of software and hardware faults • Fault Tolerance: a layer of software that is able to “intercept” faults that occur in the system. • Fault Removal: consists of either modifying the state of the system or removing the fault through debugging and testing. • Fault prediction: ability to predict a fault that may occur in the future and alerting maintenance. Hazard Analysis • FMEA Hazard Analysis (Contd.) • Fault Tree Analysis Hazard Analysis (Contd.) • Event Tree Analysis Risk analysis • Method where each of the hazards identified is evaluated more carefully. • First step: FMEA is used to make certain that the classification is correct. • Values of rating: unacceptable, acceptable or tolerable. • Redundancy Is this architecture fit for safety critical system? Safety Critical Architectures • “Do-er” / “check-er” Safety Critical Architectures (Contd.) • Two Processors Safety Critical Architectures (Contd.) • Voter Rules of Software Implementation 1. Restrict all code to very simple control flow constructs 2. Giving all loops a fixed upper bound 3. Not using dynamic memory allocation after initialization 4. No function should be longer than single sheet of paper 5. Two assertions per function Rules of Implementation (Contd.) 6. Declaring all data objects at smallest possible level of scope 7. Check return value of non-void functions and check parameters provided by caller 8. Limit use of preprocessor to inclusion of header files and simple macro definitions 9. Restricting use of pointers 10. Compiling all code from first day of development Software Implementation Strategies Have a well-defined, repeatable peer-review process Using existing safety coding standards • MISRA C, 127 guidelines for using C in safety-critical applications • Updated later to include 121 required rules and 20 advisory Handle all combinations of input data • Account for every combination of input value • Check the external data that is coming into the system for all possible values if ( input_data_byte == 0 ) { Movement = STOP; } else if ( input_data_byte == 1 ) { Movement = GO; } else { Movement = STOP; // Most restrictive case here Log_Error( INP_DATA_BYTE_INV, “Unknown Value” ); } Specific variable value checking if ( relay_status != RELAY_CLOSED ) { DO_Allow_Movement(); // Let the vehicle move, everything ok } else { DO_Stop(); //The relay isn’t positioned correctly, stop! } Specific variable value checking (Contd.) if ( relay_status == RELAY_OPEN ) { DO_Allow_Movement(); // Let the vehicle move, everything OK } else if ( relay_status == RELAY_CLOSED ) { DO_Stop(); // It is closed, so we need to stop } else // This case shouldn’t happen - { DO_Stop(); //The relay isn’t positioned correctly, stop! Log_Error( REL_DATA_BYTE_INV, “Unknown Value” ); } Mark safety-critical code sections Timing execution checking • Checking that all intended software is able to run in a timely manner • Making sure that all lower priority tasks are able to run • Making sure that the entire clock rate of the system hasn’t slowed Stale Data • Making sure there is no stale data in the system • Finding a way to delete data once safety critical code has generated output • Using sequence numbers in case of serial data • CRC or error check for large blocks of data Comparison of outputs • Cross-checking outputs of safety-critical functions • One processor running in parallel with another • For serial stream of data: Comparison of outputs (Contd.) Initializing data to least permissive state • Continuously making decisions on whether to allow any state to be more permissive than the least permissive • “Safest Condition” • Initializing the code in safe state Order of execution • Having safety checks for code sections running one after another • Using sequence number for the tasks • Using semaphores/flags • Running tasks in time frames Volatile data checking • • • • • • Integrity checks for offboard data using CRC The most useful parameter of CRC is Hamming distance Checking all safety critical volatile data Updated data set to least permissive state Non – updated data checked using CRC Check safety-critical data used throughout the code. Non-volatile data checking • Calculating a CRC for the program image at build time • Using multiple CRCs for various sections of the code space • Making sure low priority task doing image check runs • Check integrity of the image Make sure the entire system can run • • • • • RTOS may become complicated for a safety critical system More checking needed for safety-critical tasks in RTOS system Easier checking in simple scheduler system External timing circuit that provides a reference Making sure that the timing is real or not Remove “dead” code • Remove any code not currently being called • Putting conditional compiles around the block of code #if defined (LOGDEBUG) index = 20; LOG_Data_Set( *local_data, sizeof( data_set_t )); #endif Remove “dead” code (Contd.) #if defined (LOGDEBUG) #if !defined(DEBUG) neverCompile #else index = 20; LOG_Data_Set( *local_data, sizeof( data_set_t )); #endif #endif Fill unused memory • Filling non volatile unused memory with meaningful data • Filling the memory with instructions that causes the processor to reset Static code analysis • Running static code analyzer when code is compiled • No warnings at the end of analysis Aviation - SIFT • Aircraft control computer system • Life – threatening failure less than 10-10 per hour in 10 hour flight • Replicates processors and adaptive voting • Voting mechanism implemented entirely in software • Verified in two stages Monitor – Actuator Pattern References • J. Bowen, “Safety-critical systems, formal methods and standards”, Software Engineering Journal (Volume:8 , Issue: 4) • G.J. Holzmann, “The power of 10: rules for developing safetycritical code”, Computer (Volume:39 , Issue: 6) • http://link.springer.com/chapter/10.1007/978-3-642-336782_30 THANK YOU