STEREO FAULT PROTECTION CHALLENGES AND LESSONS LEARNED P STEREO Fault Protection Challenges and Lessons Learned George J. Cancro and Michael D. Trela erforming the fault protection task for the Solar TErrestrial RElations Observatory (STEREO) spacecraft forced the design team to deal with multiple issues that are either not emphasized or not encountered on single-spacecraft missions. The decisions made during STEREO development regarding redundancy, test philosophy, test execution, staffing, and data handling were directly impacted by the number of spacecraft. Looking back over the STEREO development, several lessons learned emerged in the area of fault protection for the design, testing, and operations of multispacecraft missions. This paper captures and documents seven lessons along with rationale and examples from the STEREO development in the hope that future multispacecraft missions can benefit from this information. INTRODUCTION Fault protection is defined as “the use of cooperative design of flight and ground elements to detect and respond to perceived spacecraft faults.”1 The goal of fault protection is to achieve mission reliability objectives without violating program resources. Fault protection must achieve this goal by balancing project risk and the cost of developing, testing, and operating proposed fault detectors and remedies. With multispacecraft missions, such as the twin Solar TErrestrial RElations Observatory (STEREO) spacecraft, the goal of fault protection is further complicated. Design, JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 28, NUMBER 2 (2009) testing, and operation decisions regarding redundancy, test philosophy, and operations staffing, for example, are directly impacted by the number of spacecraft. Performing the fault protection task for STEREO forced the design team to deal with multiple issues that are either not emphasized or not encountered on single-spacecraft missions. During performance of the fault protection task for the STEREO spacecraft, several lessons emerged that will be beneficial to future multispacecraft missions. This paper discusses these lessons, which are divided into three sections based on three phases of develop- 155­­­­ G. J. CANCRO and M. D. TRELA ment: design, testing, and operation. For each phase, a brief summary of fault protection duties is provided. Following the summary, two to three detailed lessons from that phase are discussed. For each lesson, a high-level statement is made, and examples from STEREO are provided as the rationale for advocating the lesson. DESIGNING TWO OF A KIND Effects of Combined Science Each STEREO spacecraft has 11 separate instruments in three different instrument suites. No hardware redundancy was allocated to either spacecraft because of project resource constraints; therefore, all instruments were single string. In single-spacecraft missions, science instruments can exhibit functional redundancy to the science objectives (i.e., the observations from one instrument or another can achieve the same level 1 science objective). Because of the large number of instruments and the desire for combined science (i.e., to combine observations of similar instruments on both spacecraft simultaneously) on STEREO, an instrument-redundancy matrix comparing all science instruments on both spacecraft to the science objectives was generated. A section of this matrix is shown in Fig. 1. For each set of science objectives, multiple instrument redundancy options were generated to achieve the science objective. The table exposes a twofold increase in functional redundancy with two spacecraft. For example, with STEREO, science objective 1A could be met with the COR1, EUVI, or S/WAVES instruments on either spacecraft. In addition, single-string Science Objectives Redundancy Options Fault protection during the design phase of a program consists of analyzing potential faults and developing the architecture that can deal best with these faults throughout the entire mission. Fault protection architecture includes the organization, design, and interrelationships of hardware, software, and operations procedures to detect and respond to perceived spacecraft faults. At this phase, decisions about hardware redundancy and critical sequence design must be made to maximize reliability within project resources. For the STEREO program, developing and launching two spacecraft with low project resource levels (both mass and dollars) forced early designs toward singlestring redundancy schemes. To increase overall reliability under these conditions, functional and selective redundancies were applied. The following sections discuss the lessons learned in applying functional and selective redundancy on the STEREO program, lessons that can apply to any future multispacecraft mission. Figure 1. Portion of the STEREO instrument-redundancy matrix showing how combinations of instruments from both spacecraft can meet science objectives. O, required for 240 days; X, required for 2 years. Sun–Earth Connection Coronal and Heliospheric Investigation (SECCHI) instruments: COR1 and -2, coronagraphs 1 and 2; EUVI, extreme ultraviolet imaging telescope; HI1 and -2, heliospheric imagers 1 and 2. S/WAVES, STEREO/WAVES. In situ Measurements of PArticles and CME Transients (IMPACT) instruments: MAG, magnetometer; SWEA, solar wind electron analyzer; STE-U/D, suprathermal electron telescope (upstream and downstream sensor heads); SEP, solar electron proton telescope; PLASTIC, PLAsma and SupraThermal Ion Composition Investigation. 156 JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 28, NUMBER 2 (2009) STEREO FAULT PROTECTION CHALLENGES AND LESSONS LEARNED instruments of both spacecraft form a physical redundancy to the science objectives. This is shown in the first two rows of objectives F and G, where the SEP instrument from Spacecraft A or Spacecraft B can be used to meet the objectives. Using functional and physical redundancy across all spacecraft in a multispacecraft mission can increase the reliability of the mission and may enable reduction in cross-strapping and redundancy on individual spacecraft. Exploit the presence of multiple instruments on multiple spacecraft in combined science missions to increase overall mission reliability Multiple Systems Tied Together Developing a redundancy plan for the STEREO spacecraft involved developing a balance between reliability and cost. The cost of the STEREO mission precluded full redundancy; therefore, the fault protection plan used selective redundancy to achieve the greatest reliability for available resources. Deciding what to make redundant in a selectiveredundancy scheme is a difficult issue. For multispacecraft missions, the first essential place to allocate limited redundancy is to any single fault that can eliminate both spacecraft. Because STEREO launched with both observatories stacked upon each other, failure to separate the stack from the launch vehicle and failure to separate the spacecraft from each other are two prime examples of failures that eliminate both spacecraft. To mitigate these failures, a reliability block diagram of the separation system was developed to ensure that the detection of stack separation and the initiation of spacecraft separation were free of single-point failures, as shown in Fig. 2. Redundancy was placed within the STEREO power subsystem to ensure that the detection and initiation contained no single-point failures. To mitigate against failure to separate the stack, requirements Sep Switch 2 Lesson 2 Focus fault architecture and redundancy on single faults that can eliminate more than one spacecraft. TESTING TWO OF A KIND Lesson 1 Sep Switch 1 were added to allow the bottom spacecraft (Spacecraft B) to release Spacecraft A upon command from the ground. Cmd/ Dec A Cmd/ Dec B Fuse Fuse Safe Relay A Safe Relay B Pyro Enable Relay A Pyro Enable Relay B Arm Plug Arm Plug Fault protection during the testing phase is ensured by challenging the spacecraft at a system level to verify that the spacecraft can survive and continue the mission in light of one or more failures. This activity includes test planning, test execution, and test review. For STEREO, testing two spacecraft provided some benefits for and some issues with the fault protection effort. The following sections discuss the lessons learned during the testing phase of the STEREO program. Multispacecraft Test Planning Testing two satellites that were nearly identical posed interesting questions regarding cost and schedule savings versus risk reduction with robust spacecraft testing. The type and amount of testing that each satellite would endure had to be determined. Several space missions have performed testing programs with more manufacturing-like methods, in which a full set of system and environmental tests was performed on the first few spacecraft and then a small subset of these tests was run on later production models to demonstrate basic functionality or workmanship. A good example of this was the Globalstar test program.2 The STEREO program selected a test philosophy called “core and non-core.” Core tests were developed to test the base level of fault protection that could keep the spacecraft safe until the next available ground contract. These tests covered the “bootstrap” functionality of the STEREO fault protection system designed to protect the spacecraft if all other responses failed to FET Control FPGA FET Control FPGA Pyro Arm FET A Pyro Arm FET B Pyro 1 Fire FET A NSI 1A Pyro 2 Fire FET A NSI 1B Pyro 1 Fire FET B NSI 2A Pyro 2 Fire FET B NSI 2B Sep Band Cutter 1 Battery (10 for 11) Sep Band Cutter 2 Figure 2. Reliability block diagram for separation of the two STEREO observatories. Cmd/Dec, command decoder; FET, field effect transistor; FPGA, field-programmable gate array; NSI, NASA Standard Initiator; Sep, separation. JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 28, NUMBER 2 (2009) 157­­­­ G. J. CANCRO and M. D. TRELA 158 Lesson 3 Develop a test philosophy early that not only discusses which test will be executed on which spacecraft but also plans for when tests fail and determines the regression test suite of the most critical fault scenarios. Multispacecraft Test Execution/Review The core and non-core philosophy was chosen because both spacecraft were being assembled simultaneously in I&T. A serial I&T would have been more conducive to a manufacturing-like approach. The parallel STEREO I&T, however, enabled the core and non-core test plan to use both spacecraft as test platforms, resulting in a twofold increase in test time available to the fault protection team. The parallel I&T also enabled the fault protection test team to avoid schedule hits associated with down time or periods when boxes had to be removed from the spacecraft. Non-core tests could be moved from one spacecraft to the next without slowing the schedule. By the end of I&T, all 31 planned fault protection tests were executed as described in the original test plan. Considering that the 10 core regression tests were repeated three times on both spacecraft, all 21 non-core tests were executed on either spacecraft, and several tests had to be repeated because they did not achieve all test objectives on the first execution on the spacecraft, a total of 113 tests were executed on the STEREO program. Per our test-review plan, all of the 113 test runs had to be reviewed to ensure that all objectives were met, all problems and failures were understood, and any behavior outside of the test plan was documented and understood. Because each test took from 4 to 8 h to execute and each STEREO spacecraft produces approximately 9 million telemetry values per hour, >4 billion telemetry values had to be reviewed. Faced with this daunting task, the STEREO fault protection team developed a new telemetry review tool called the STEREO Autonomy Visualizer (SAV)3 that enabled the team to examine up to 2500 telemetry values simultaneously and select and plot 100 telemetry points 30 min before and after the current review time. The user controls the current time in a TiVo-like manner, providing real-time random access to all the telemetry generated during the test. The effect of the addition of the tool into the test-review effort was significant, as demonstrated in Fig. 3. 60 Burn-Down Plot of System-Level Test Reports Completed 50 Test Reports Remaining remedy a problem. The non-core tests represented all other of the required fault protection functionality. The core test set had to be run on both spacecraft as a prerequisite for launch and also formed the regression suite. The non-core test set could be run on either spacecraft and did not have to be repeated if run successfully on one spacecraft. A total of 31 fault protection tests were generated (10 core and 21 non-core). The core and non-core philosophy provided efficiency and flexibility, as described in the next section, by being able to take full advantage of the parallel STEREO integration and testing (I&T) program; however, the philosophy was not clear on the next step to take after a non-core test failed. Early in I&T, a non-core test involving a fault of the spacecraft onboard oscillator (time source) failed to correct the problem resulting in the spacecraft becoming nonresponsive. The root cause, determined to be a software bug, was fixed, and the test was rerun successfully. Because the test was noncore and succeeded, the test was never run again. However, later on in the program, the hardware simulator team used the fault protection core and non-core tests to validate the simulator. During the simulator validation, the oscillator test failed again. The root cause was determined to be later software releases that exposed the original flaw. Fault protection regression had been executed with the new software release, but because the oscillator test was non-core, it was not run as part of the regression. The lesson here is a difficult one. The test was classified as non-core because an operational work-around existed; however, one can argue that the nonresponsive failure should have resulted in the addition of the oscillator test to the core regression suite. Either way, the test philosophy needs to be carefully developed for the effects and risks of test failure, and regression should be developed and communicated early. 40 30 Test reports remaining Rate of completion prior to SAV Rate of completion after SAV 20 10 11/30/05 12/20/05 01/09/06 01/29/06 02/18/06 03/10/06 03/30/06 Figure 3. Test-review improvement with SAV. JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 28, NUMBER 2 (2009) STEREO FAULT PROTECTION CHALLENGES AND LESSONS LEARNED Take advantage of multiple spacecraft to achieve an n-fold increase in test time but also plan methods or automation to analyze n-fold more test data. OPERATING TWO OF A KIND Fault protection in the operations phase means engineering support is provided during critical events, both for diagnosis of on-orbit anomalies to determine failure and root cause and for maintaining and updating onboard fault protection routines. For STEREO, performing fault protection during the operations phases revealed our poor assumptions that single teams and common implementations are sufficient for performing fault protection for multispacecraft missions. At the same time, however, flying multiples enhanced our ability to determine failure. The following sections discuss the lessons learned during the operations phase of the STEREO mission. Issues with Staffing Up until launch, the STEREO fault protection team was sized as if one spacecraft was under development, not two. With good planning and automation, the smaller team was able to succeed in accomplishing all objectives from design to testing. Nominally, it was thought that one individual would be responsible for monitoring the current state of the autonomy system on both spacecraft during launch. However, as the team prepared for launch, the idea that different faults could occur simultaneously Spacecraft A HGA deploy Spacecraft B HGA deploy on both spacecraft drove an increase in staff to support critical operations. Before, during, and after the launch, there were numerous activities that demanded the attention of the fault protection team. However, it was found that the long time span in which engineers were required to monitor and support the launch preparation and postlaunch activities led to the creation of three shifts: prelaunch (Launch – 13 h until Launch – 5 h), launch (Launch – 5 h until Launch + 7 h), and postlaunch (Launch + 7 h until Launch + 19 h). More importantly, the team decided that the work required for each spacecraft necessitated staffing one engineer per spacecraft on each shift. Both prelaunch and postlaunch activities are similar for both spacecraft but may or may not be synchronous in time; therefore, a single engineer responsible for both spacecraft might become distracted or unaware of the current state of one of the spacecraft. For this reason, it seemed best that one engineer remain focused on each spacecraft. Lesson 5 Staff launch and simultaneous, critical events with a dedicated engineer for each spacecraft on each shift. Advantage of Operating Multiples One great advantage to flying multiple identical spacecraft is the ability to compare data from the similar operations of different spacecraft to gain insight on all spacecraft. A perfect example of this was the high-gain antenna (HGA) deployment as shown in Fig. 4. CE_ANG_MOM_Z CE_MOM_COLOR CE_PID_CONTROL CE_WHL_TORQ_SAT HGA_DPLY_TT_CDA_ ME_HGADPLY_ARM_ CE_ANG_MOM_Z CE_MOM_COLOR CE_PID_CONTROL CE_WHL_TORQ_SAT HGA_DPLY_TT_CDA_ ME_HGADPLY_ARM_ HGA “fully deployed” indicator Lesson 4 Figure 4. HGA deployment on both spacecraft shows how STEREO A has an indicator failure and not a failed deployment. JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 28, NUMBER 2 (2009) 159­­­­ G. J. CANCRO and M. D. TRELA STEREO B deployed its HGA during the postlaunch period. The deployment was detected by hardware switches indicating the release of the dish and achievement of the fully deployed state. The deployment can also be detected in the guidance and control states of the spacecraft, where the deployment induces a rotation in the spacecraft body, an increase in overall momentum, and a measured response in the onboard reaction wheels to control the momentum increase, as shown in the plots at the top of Fig. 4. When STEREO A attempted the deployment of its HGA, the “fully deployed” hardware switches indicated that the HGA had not fully deployed. The reading of this switch caused alarm because it was uncertain whether the switch was providing a false-positive or a false-negative reading. This issue was solved by comparing the body rate data of each spacecraft at the time of the HGA deployment. The signature of the body rates and other G&C states were similar on both spacecraft, however, indicating that the HGA had fully deployed. Using the data from both spacecraft simultaneously enabled engineers to deduce that one of the switches on Spacecraft A had failed. Had the data from the second spacecraft not been readily available, it is likely that the operations team would have delayed using the HGA for a significant period of time until a series of checkout tests could be performed to verify whether the deployment had been successful. The ability to compare data plots across spacecraft provides tremendous additional information. To use this information, mechanisms must be in place to rapidly make the comparison. In addition, the mechanisms must have the ability to manipulate the telemetry to align events rather than examine the telemetry based on time, because even identical spacecraft cannot conduct operations synchronously. In the HGA example, the time between events was >30 min. The fault protection team had to align the G&C telemetry responses to properly interpret the event. Lesson 6 Use data from one spacecraft in a set to diagnose a fault on another spacecraft in the set. Deviation from Commonality All efforts to maintain commonality between multiple spacecraft missions can be easily disrupted following the first on-orbit failure. For STEREO, commonality was broken 6 months after launch with the failure of the inertial measurement unit (IMU) on Spacecraft A. The IMU failure on Spacecraft A forced the spacecraft to operate on the backup IMU while Spacecraft B 160 continued to operate on the prime IMU. Because the STEREO fault protection team did not fully understand the mechanism causing the IMU failure, the team was forced to devise a way for Spacecraft A to survive a second IMU fault. Before the solution was loaded to Spacecraft A, the team considered whether the same solution should be loaded to Spacecraft B strictly for the purposes of maintaining commonality. In the end, the team found that it was beneficial to keep the configuration as similar as possible on both spacecraft because the new changes did not add any significant risk to Spacecraft B. This policy of maintaining commonality is the safest for multiple spacecraft because “noncommon” spacecraft lead to “noncommon” operations, during which operator error could be more common because an operations caveat is forgotten or applied to the wrong spacecraft. However, it is evident that maintaining commonality may not always be possible; if a situation requiring significant modifications to the normal operating state of one spacecraft arises, the risk associated with deploying the new operating state on the other spacecraft could be greater than the risk of managing two different spacecraft configurations. Preparing for deviations involves not only developing philosophies on the steps to take when onorbit failures occur on one of the spacecraft but also evaluating tools, parameters, and documentation. For example, STEREO maintains documentation of the onboard autonomy system on a website to enable fault protection, systems, and operations staff to understand detections and reactions to faults. However, this single website was never designed to differentiate between Spacecraft A and Spacecraft B. This issue illustrates why preparing for deviations early in the design phase is most prudent. Lesson 7 Departure from commonality during operations is inevitable. CONCLUSIONS The design, testing, and operation of the twin STEREO spacecraft have yielded several interesting lessons for the development and execution of future fault protection efforts. As NASA moves forward, we expect more and more missions to contain multiple spacecraft. Consideration of these lessons early in any future effort will be beneficial to the program and ensure that future fault protection teams are adequately prepared for the differences between single-spacecraft and multispacecraft missions. JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 28, NUMBER 2 (2009) STEREO FAULT PROTECTION CHALLENGES AND LESSONS LEARNED REFERENCES 1National Aeronautics and Space Administration, Fault Protection, NASA Preferred Reliability Practices, Practice Number PD-EC-1243 (Oct 1995). 2Costabile, V., Discepoli, A., Fiorentino, C., and Morelli, G., “New Spacecraft Assembly, Integration and Test Approach for Globalstar Satellite Constellation,” in Proc. 16th International Communications Satellite Systems Conf., American Institute of Aeronautics and Astronautics (AIAA), Washington, DC, pp. 1320–1330 (1996). 3Cancro, G., Turner, R., Nguyen, L., Li, A., Sibol, D., Gersh, J., Piatko, C., Montemayor, J., and McKerracher, P., “Innovative Interactive Visualization System for Analyzing Spacecraft Telemetry,” in Proc. IEEE Aerospace Conf., Big Sky, MT, 10.1109/AERO.2007.352952 (2007). The Authors George J. Cancro is the Assistant Group Supervisor of the Embedded Applications Group in the Space Department. He holds a B.S. in engineering science from The Pennsylvania State University and an M.S. in mechanical engineering– astronautics from the George Washington University. Before joining APL in 2002, he worked at the NASA Jet Propulsion Laboratory and NASA Langley Research Center on projects such as Mars Global Surveyor Aerobraking and the Dawn Mission to Vesta and Ceres. Since joining APL, he has worked as a System Engineer on the MESSENGER (Mercury Surface, Space Environment, Geochemistry, and Ranging), New Horizons, and STEREO missions, as a Project Manager for the NASA Small Satellite (SmallSat) project, and as a Principal Investigator of two research projects in the areas of autonomy and telemetry visualization. He currently is Principal Investigator of a research project investigating spacecraft tactical commanding and an advisor to the NASA Constellation program in the area of fault detection, isolation, and recovery. His areas of interest include modular software, hardware/software architectures, fault protection, and spacecraft autonomy. Michael D. Trela has been a member of the Space Systems Applications Group at APL since 2005. He received a B.S. in aerospace engineering from the University of Notre Dame in 2004 and an M.S. in aeronautics and astronautics from Stanford University in 2006. He has worked as a Fault Protection Engineer on numerous APL missions, including STEREO, MESSENGER, New Horizons, and Solar Probe. He also works as a Propulsion Analyst for the MESSENGER program. He currently is the Principal Investigator of a research project to improve systems engineering methods for verification, validation, and testing. For further information on the work reported here, George J. Cancro Michael D. Trela contact George Cancro. His email address is george.cancro@jhuapl.edu. JOHNS HOPKINS APL TECHNICAL DIGEST, VOLUME 28, NUMBER 2 (2009) 161­­­­