Safety and Reliability of Embedded Systems

advertisement
Safety Critical Software
Development
- Suparna Paruthy
Safety Requirements
•
•
•
•
Customer interaction
Similar product in same intended market
Competitive intelligence
Professional assistance
Certification Killers
•
•
•
•
•
•
Unclear requirements
Lack of clear evidence of compliance
Not doing research up front
Lack of dedicated resources
Trying to safety certify too many things
Not accounting for enough resources to document the safety
case
• Not using a single contact to interface with the assessor
• Not being honest with the weaknesses of the proposed system
Project Planning Strategies
1. Determine the project certification scope early
Identify which standards your product
needs to meet.
2. Determine feasibility of certification
Answer questions up front whether the product and
solution are technically and commercially feasible.
3. Select an independent assessor
Finding an assessor that has experience with your market
segment.
Project Planning Strategies (Contd.)
4. Understand your assessor’s role
The assessor’s job is to assess your product with respect to
compliance with standards and norms.
5. Assessment communication is key
A clear line of communication between your team and the
group controlling the standards
6. Establish a basis of certification
Listing all of the standards and directives that your product
needs to comply with.
Project Planning Strategies (Contd.)
7. Establish a “fit and purpose” for your product
Establishing a fit and purpose up front will prevent future
headaches!
8. Establish a certification block diagram
Generating a hardware block diagram of the system.
9. Establish communication integrity objectives
Determine the “residual error” rate objectives for each
digital communication path.
10. Identify all interfaces along the certification boundary
Generating a boundary “Interface Control Document”.
Project Planning Strategies (Contd.)
11. Identify the key safety defensive strategies
Identifying and have the safety defensive strategies used
to achieve the safety objectives for the program.
12. Define built in test (BIT) capability
Identifying the planned BIT coverage, including
initialization, periodic, conditional, and user-initiated.
13. Define fault annunciation coverage
Keeping the system and user interface in mind, define
which faults get annunciated.
14. Define reliance and expectation of the operator/user
Clearly defining any reliance that is placed on the operator
or user to keep the system safe.
Project Planning Strategies (Contd.)
15. Define plan for developing software to appropriate integrity
level
Address the compliance with each element of the
applicable standard you are certifying to.
16. Define artifacts to be used as evidence of compliance
Listing all of the documents and artifacts you plan to
produce as part of the system safety case.
17. Plan for labor-intensive analyses
Plan on conducting a piece-part FMEA (failure modes
and effects analysis).
Project Planning Strategies (Contd.)
18. Create user-level documentation
Plan on having a users’ manual.
19. Plan on residual activity
Any changes to configuration must be assessed for the
impact on safety certification.
20. Publish a well-defined certification plan
Document a certification plan.
Faults, Errors, and Failures
• Fault: a characteristic of an embedded system that could lead
to a system error.
• Error: an unexpected and erroneous behavior of the system.
• Failure: a system event not performing its intended function.
Availability and Reliability
• Availability: a measure of how much the embedded system
will be running and delivering the expected services.
• Reliability: the probability that an embedded system will
deliver the requested services at a given point in time.
• Systems that are both highly reliable and highly available are
said to be dependable.
Fault Handling
• Fault Avoidance: developing a system that helps prevent the
introduction of software and hardware faults
• Fault Tolerance: a layer of software that is able to “intercept”
faults that occur in the system.
• Fault Removal: consists of either modifying the state of the
system or removing the fault through debugging and testing.
• Fault prediction: ability to predict a fault that may occur in
the future and alerting maintenance.
Hazard Analysis
• FMEA
Hazard Analysis (Contd.)
• Fault Tree Analysis
Hazard Analysis (Contd.)
• Event Tree Analysis
Risk analysis
• Method where each of the hazards identified
is evaluated more carefully.
• First step: FMEA is used to make certain that
the classification is correct.
• Values of rating: unacceptable, acceptable or
tolerable.
• Redundancy
Is this architecture fit for safety
critical system?
Safety Critical Architectures
• “Do-er” / “check-er”
Safety Critical Architectures (Contd.)
• Two Processors
Safety Critical Architectures (Contd.)
• Voter
Rules of Software Implementation
1. Restrict all code to very simple control flow
constructs
2. Giving all loops a fixed upper bound
3. Not using dynamic memory allocation after
initialization
4. No function should be longer than single
sheet of paper
5. Two assertions per function
Rules of Implementation (Contd.)
6. Declaring all data objects at smallest possible
level of scope
7. Check return value of non-void functions and
check parameters provided by caller
8. Limit use of preprocessor to inclusion of
header files and simple macro definitions
9. Restricting use of pointers
10. Compiling all code from first day of
development
Software Implementation Strategies
Have a well-defined, repeatable peer-review
process
Using existing safety coding standards
• MISRA C, 127 guidelines for using C in safety-critical
applications
• Updated later to include 121 required rules and 20 advisory
Handle all combinations of input data
• Account for every combination of input value
• Check the external data that is coming into the system for all
possible values
if ( input_data_byte == 0 )
{
Movement = STOP;
}
else if ( input_data_byte == 1 )
{
Movement = GO;
}
else
{
Movement = STOP; // Most restrictive case here
Log_Error( INP_DATA_BYTE_INV, “Unknown Value” );
}
Specific variable value checking
if ( relay_status != RELAY_CLOSED ) {
DO_Allow_Movement(); // Let the vehicle move, everything ok
}
else {
DO_Stop(); //The relay isn’t positioned correctly, stop!
}
Specific variable value checking
(Contd.)
if ( relay_status == RELAY_OPEN ) {
DO_Allow_Movement(); // Let the vehicle move, everything OK
}
else if ( relay_status == RELAY_CLOSED ) {
DO_Stop(); // It is closed, so we need to stop
}
else // This case shouldn’t happen - {
DO_Stop(); //The relay isn’t positioned correctly, stop!
Log_Error( REL_DATA_BYTE_INV, “Unknown Value” );
}
Mark safety-critical code sections
Timing execution checking
• Checking that all intended software is able to run in a timely
manner
• Making sure that all lower priority tasks are able to run
• Making sure that the entire clock rate of the system hasn’t
slowed
Stale Data
• Making sure there is no stale data in the system
• Finding a way to delete data once safety critical code has
generated output
• Using sequence numbers in case of serial data
• CRC or error check for large blocks of data
Comparison of outputs
• Cross-checking outputs of safety-critical functions
• One processor running in parallel with another
• For serial stream of data:
Comparison of outputs (Contd.)
Initializing data to least permissive
state
• Continuously making decisions on whether to allow any state
to be more permissive than the least permissive
• “Safest Condition”
• Initializing the code in safe state
Order of execution
• Having safety checks for code sections running one after
another
• Using sequence number for the tasks
• Using semaphores/flags
• Running tasks in time frames
Volatile data checking
•
•
•
•
•
•
Integrity checks for offboard data using CRC
The most useful parameter of CRC is Hamming distance
Checking all safety critical volatile data
Updated data set to least permissive state
Non – updated data checked using CRC
Check safety-critical data used throughout the code.
Non-volatile data checking
• Calculating a CRC for the program image at build time
• Using multiple CRCs for various sections of the code space
• Making sure low priority task doing image check runs
• Check integrity of the image
Make sure the entire system can run
•
•
•
•
•
RTOS may become complicated for a safety critical system
More checking needed for safety-critical tasks in RTOS system
Easier checking in simple scheduler system
External timing circuit that provides a reference
Making sure that the timing is real or not
Remove “dead” code
• Remove any code not currently being called
• Putting conditional compiles around the block of code
#if defined (LOGDEBUG)
index = 20;
LOG_Data_Set( *local_data, sizeof( data_set_t ));
#endif
Remove “dead” code (Contd.)
#if defined (LOGDEBUG)
#if !defined(DEBUG)
neverCompile
#else
index = 20;
LOG_Data_Set( *local_data, sizeof( data_set_t ));
#endif
#endif
Fill unused memory
• Filling non volatile unused memory with meaningful data
• Filling the memory with instructions that causes the processor
to reset
Static code analysis
• Running static code analyzer when code is compiled
• No warnings at the end of analysis
Aviation - SIFT
• Aircraft control computer system
• Life – threatening failure less than 10-10 per hour in 10 hour
flight
• Replicates processors and adaptive voting
• Voting mechanism implemented entirely in software
• Verified in two stages
Monitor – Actuator Pattern
References
• J. Bowen, “Safety-critical systems, formal methods and
standards”, Software Engineering Journal (Volume:8 , Issue:
4)
• G.J. Holzmann, “The power of 10: rules for developing safetycritical code”, Computer (Volume:39 , Issue: 6)
• http://link.springer.com/chapter/10.1007/978-3-642-336782_30
THANK
YOU
Download