John Connelly
Exelon Generation
Engineering Manager – Capital Projects
Implementation of digital modifications is an industry wide issue:
• IER 11-02 identifies adverse trend in SCRAMS between 2005 and 2010
• 43 SCRAMS (35%) were the result of flawed implementation of Design Changes involving digital technology
INPO 10-008 examined events from 2003 to 2007
• 17 SCRAMS from software malfunctions resulted in loss of 1.6 million MWh
• 24 SCRAMS from hardware malfunctions resulted in a loss of 3.1 million MWh
• Significant operational and safety challenges
• A modest $50 / MWh yields an industry-wide cost of ~$200M
2
• Flaws in the processes by which digital modifications are implemented
• Inadequate knowledge of the complex technologies and techniques common to nearly all digital modifications
3
Performance Objectives for the Design Change Process (CM.3) are under revision
Future INPO evaluations will include a review of the processes by which you manage the unique characteristics of digital technology.
This includes:
• Development and control of procurement specifications
• Software
• Vendor interfaces
• Testing
• Validation
• Failure Modes and Effects Analysis
4
Application of digital technology requires very different and specialized skills to implement correctly
INPO ACAD 9804, Rev 2 introduces the entity of “Digital Engineer”
Engineers assigned to work independently on digital projects must be qualified to ACAD 98-04, Rev 2 by March 2013
Training evaluations conducted after March of 2013 will be in accordance with the requirements of ACAD 98-04, Rev 2
5
Digital technology, while superior in nearly every dimension to analog technology, requires very different competencies and processes:
• Software engineering
• Hardware design
• Exception / Fault / Error Handling / Recovery
• Networking
• Cyber Security
• Human Factors Engineering
• Advanced analysis techniques (FMEA / SHA / CDR)
• EMI / RFI
• Interfacing systems knowledge
• Plant Operations
• Testing / Dynamic response analysis
• Life-Cycle Management
6
• Engineering processes for “conventional” modifications do not, by themselves, provide an adequate defense against errors and events
• Requires very different skills to implement correctly
• Your design processes will be evaluated against this reality
7
Exelon Internal OPEX
A series of events beginning in 2005 made it clear that improvement opportunities existed
The Quad Cities Reactor Recirculation Adjustable Speed Drives (ASD) provides a representative example of the challenges
• Approximately 150 Issue Reports
• Manual scram, power reductions and operational challenges
Principle findings from CCA:
• Latent design flaws in vendor products
• FMEA did not detect design issues
• Excessive reliance on vendors
• Testing failed to uncover issues
Similar experiences with other modifications
9
9
Redesigning the process at Exelon
Formed Corporate Capital Projects Group to oversee large, multi-site digital modifications (RRASD, DEH, MPT, TDFWP, BOP 7300…)
Staffed with subject matter experts on digital technology
CPG works closely with implementing engineers at the sites who manage the EC development process
Advanced training provided to site and corporate digital I&C engineers to jump start performance
Procedures and processes revised to capture best practices – process improvements will be continue indefinitely as practices continue to mature
10
10
Exelon Digital Modification Process
The existing Configuration Control process is now supplemented with procedures that address the unique attributes of digital technology
Management Of Digital Modifications
Digital Design Considerations
Design Attributes For Digital Systems
Software Development
Digital Procurement Process
Factory Acceptance Testing
Cyber Security
The process continues to evolve as Cyber Security requirements are implemented and additional best practices are identified
11
11
Typical Project Lifecycle
13
13
Procurement Specifications
The act of fully defining detailed vendor requirements commensurate with project safety significance, operational risk and project scope.
Specifically identifying documentation and performance requirements for a given project including (but not limited to):
Verification and Validation (V&V) requirements
Software Quality Assurance measures
Hardware design requirements (including Single Point Vulnerabilities)
Failure Modes and Effects Analysis (FMEA) requirements
Software testing and validation requirements
Cyber Security requirements
Life Cycle Management (LCM) requirements
Time invested in the development of a detailed procurement specification improves project execution by avoiding unbudgeted scope changes
14
14
No system will ever be perfect no matter how rigorous the development process used or amount of money spent to develop and maintain it – humans develop software and humans will always make mistakes
Highly automated systems effectively move the point of error from the user (Operations and Maintenance) to the programmer but human error still exists
The Space Shuttle flight control system was arguably the most rigorously developed and tested control system ever conceived
• 400,000 words (very small footprint compared to a modern DCS)
• $100,000,000 per year in maintenance
• Over the 25 year shuttle program, 16 Severity Level 1 software issues were identified –
SL1 issues are those that would result in the loss of the orbiter under the right conditions
16
16
Software malfunctions are systemic, not random
In the absence of hardware induced fault, instructions will execute exactly as written unerringly and without exception
Software malfunctions require the simultaneous existence of two conditions:
• An error must be present (often undetected)
• An initiating event must occur
If both conditions are not satisfied, no error will occur
17
17
A Representative Example From Aerospace
The Event:
• A completed commercial airliner is about to be delivered to the customer
• A Factory Acceptance Test is being conducted by factory and customer personnel in which the parking brakes are applied and all four engines are taken to maximum continuous thrust
• At this power setting and altitude (zero feet) the flight control system automatically selects
“takeoff” mode as designed
• The flight control system correctly recognizes that the wing surfaces are incorrectly configured for a takeoff and continuously sounds the Ground Proximity Warning (GPW) alarm as designed
– this alarm is critical and cannot be silenced
• A technician, irritated by the alarm and unable to silence it, trips the feed breaker for the
GPW system knowing that this will de-energize the alarm
• Ground proximity radar loses power and clears the zero altitude interlock
• With the interlock cleared, control system now concludes the plane is in the air and releases the brakes – this is a programmed behavior to prevent landing the aircraft with the brakes set
• Plane immediately accelerates (no passengers or luggage and little fuel) and strikes the jet blast barrier at full power
18
18
19
19
20
20
A Representative Example From Aerospace
• The error must be present and undetected:
This application software had been in service for years and “ground run-up” tests are somewhat routine
• The initiating event must occur:
The loss of supply voltage to GPW interlock caused the brakes to release exactly as they were programmed to do.
The software development team never envisioned this combination of events
21
21
22
Changes can invalidate previous testing or introduce new errors
22
Integration With Cyber Security Requirements
The Cyber Security Rule (10 CFR 73.54) is a license condition that applies to any digital component that is:
• Safety Related
• Important To Safety (defined as reactivity impact)
• Physical Security
• Emergency Preparedness
• Systems that support any of the above
• Systems with pathways of connectivity to any of the above
Significant synergies exist between the Digital I&C process and Cyber Security
Consider the extent to which these processes are interconnected and aware of the other
24
24
25
25
Factory Acceptance Testing
Many test plans focus on “positive testing” which confirms expected responses for a given set of inputs or stimulus conditions – informative but only to a point
Negative testing focuses on verifying that you don’t get an unexpected response when you combine unusual stimulus or do something outside of normal operation
– effectively its an attempt to trigger a malfunction which can be very informative
It’s nearly inevitable that over the life of a system, it will be operated in a way the designers never anticipated. Take advantage of unstructured testing opportunities
(i.e. pre-
FAT) to attempt to “break” the system early in the development cycle while there is ample opportunity to take corrective action for issues identified
Process needs to involve Operators, System Engineers and SME’s
27
27
Modification Acceptance Testing
Most modification issues are not with the systems themselves but rather interfaces to installed plant hardware (power / hydraulics / supporting systems / actuators / protective devices / EMI / RFI…)
The Mod Acceptance Test (MAT) is the very first time the system will be tested in the plant environment. In some cases it will be the first time that the system is connected to any physical components and therefore represent the first opportunity to identify and correct interface issues – care should be taken to exercise every interface to the extent possible and as early as possible
All models are wrong – this includes your plant simulator and vendor simulation models therefore in-plant testing is critical and your most robust line of defense
28
28
Ongoing Configuration Control
One of the advantages of digital systems is that they are easily modifiable – this also constitutes a vulnerability if not taken into consideration by the process
• Processes need to exist to detect any inadvertent changes to a systems configuration
• “Baseline / Compare” utilities can be used to compare system states with a known and approved baseline configuration
• Periodic audits of log, system and event files
• Surveillance testing
• Defined protocols for testing of authorized modifications (i.e. regression testing)
Not all changes are modifications
• Changes to calibration constants controlled in accordance with maintenance procedures
• Pre-evaluated adjustments (tuning within defined boundaries)
• Specific changes for Cyber Security incident response in accordance with CS procedures
Reference EPRI Topical Report 1022991 – “Guideline On Configuration
Management For Digital Instrumentation And Control Equipment And Systems”
30
30
31
31